A Fresh Cup of Zip
Dr. Dobb's Journal
December, 1997 by Mark Nelson |
Introduction
Sun has given Java developers a new set of library components that provide support for reading
and writing Zip files. The addition of powerful compression and archiving abilities to Java is a
boon to developers, who have traditionally had to rely on either third party libraries or
proprietary tools for these functions. This article takes a quick look at how to use Sun’s java.util.zip
package, and how to avoid a few common mistakes that the library overlooks.
The New Arrival
As I’m writing this, Sun has just released the 1.1 release of the Java Developer’s Kit, or JDK. For a point release, JDK 1.1 has quite a few major changes. One of those changes was the addition of JAR files. JAR files are simply compressed archives that contain various components of a Java applet, including class files, images, sound clips, etc. The JAR format should speed up the loading of applet components across the web, first by compressing data, and second by reducing the number of separate transactions required during the download process.
Sun has pitched its tent squarely in the middle of the kingdom of openness, so naturally the JAR file format needs to adhere to an accepted industry standard. Sun chose to use the ZIP file format, which had a couple of big advantages. First, commercial and free ZIP tools are widely available, including the deservedly revered zlib and InfoZip products. Just as importantly, the adoption of an open format like Zip allows Sun to taunt Microsoft, whose ActiveX technology requires that developers use the proprietary CAB format for compression and archiving.
To implement support for JAR files, Sun’s Java developers first ported most of zlib to pure Java.
(Some critical code has been implemented using native methods.) This took care of implementing
the deflate compression algorithm used in PKZip 2.x. They then created a fairly thin set of
wrapper classes that are used to create the archive structure around the deflated data.
The result is the java.util.zip
package.
The venerable and inscrutable Zip format
Before we look at how java.util.zip
deals with Zip processing, it helps to know a little bit
about the structure of a Zip file. Figure 1 shows the layout of the Zip format. You can think of
a Zip file as a stream oriented format, which means you can create an archive by writing
sequentially without ever having to seek backwards in the output file.
Figure 1 — The Zip file format
The Zip file starts with a sequence of files, each of which can be compressed or stored in raw format. Each file has a local header immediately before its data, which contains most of the information about the file, including timestamps, compression method and file name. The compressed file contents immediately follow, and are terminated by an optional data descriptor. The data descriptor contains the file’s CRC and compressed size, which are frequently not available when writing the local file header. (If they are, the data descriptor can be skipped.)
Each file in the archive is laid down sequentially in this format, followed by a central directory at the end of the Zip archive. The central directory is a contiguous set of directory entries, each of which contains all the information in the local file header, plus extras such as file comments and attributes. Most importantly, the central directory contains pointers to the position of each file in the archive, which makes navigation of the Zip file quick and easy.
Key classes
The complete roster of package java.util.zip
includes 14 classes, one interface, and two
exception classes. While that may sound like a lot, most users will be able to skip over the bulk
of the package. By concentrating on the four core classes of the package, you can perform the
three most common operations on Zip files: creation, extraction, and directory reads. Those
classes are:
ZipEntry
: TheZipEntry
class contains all the information needed to write both the local file header and the central directory record for a specific file, such as file sizes, time stamps, comments, attributes, etc.ZipFile
:ZipFile
is used only when reading from Zip files. It provides access to the entire central directory for the Zip file, and producesZipEntry
objects on demand for any file in the archive.ZipInputStream
: This input stream behaves like a standardInputStream
, but works on deflated streams. Calls to theread()
method on this stream will decompress bytes on the fly.ZipOutputStream
: This output stream looks like anOutputStream
, but deflates data transparently to the caller. Calls towrite()
methods will write data in the deflated format expected by the Zip standard.
Putting the classes to work
The first sample program, ZipList
, is shown in Listing 1. This program has less than 50 lines
of Java code, and it manages to print out the entire contents of a Zip file, including nicely
formatted date and time stamps. To see the contents of foo.zip, ZipList.class
can be executed
using the following command:
java ZipList foo.zip
ZipList
produces the following output for a sample Zip file:
Listing of : temp.zip
Raw Size Size Date Time Name
-------- --------- --------- ---------- --------------------------
1981717 1718555 17-Mar-97 9:05:02 PM temp/Tjava.pdf
547 272 20-Apr-97 1:36:40 PM TEMP.DAT
92870 41915 11-Jul-95 9:50:00 AM COMMAND.COM
1536 110 03-Mar-97 4:55:50 AM ~OLEAPP.DOC
Coming up with this listing couldn’t be much easier. It involves just two steps. First, I create
a new ZipFile
object by calling the standard constructor with a filename:
ZipFile z = new ZipFile( args[ 0 ] );
Once the ZipFile
object has been created, I call the entries()
method, which returns an
Enumeration
object. The Enumeration
object successively returns ZipEntry
objects, one for
each file in the Zip file. The methods in the ZipEntry
object used to read various attributes include:
With these methods, you should be able to follow the code in Listing 1 quite easily. Java makes the job even easier by providing built in classes to deal with dates, times, and strings.
Extracting Files
The first sample program accompanying this article shows you how to list the contents of a Zip
file. I built on that knowledge to create the program shown in Listing 2, ZipExtract.java
. This
GUI program lets the user enter a zip file name in a simple text box, then list the files in the
Zip file in a list box. The user can then select files, and extract them in a batch process. The
program is shown in action in Figure 2.
Figure 2 - ZipExtract.java at work
The method used to load up the list box in ZipExtract
is a stripped down variation of the code
in ZipList
. Since the list box only contains the file name, the enumeration loop only has to
call ZipEntry.getName()
.
The majority of the non-UI code in ZipExtract
is found in the two methods that are called when
the Extract Files button is pressed. The extractFiles()
method consists of a loop that
is run through once per selected file name in the list. The routine calls ZipFile.getEntry()
for each file to get the ZipEntry
object associated with the given file name. It would be nice
if at that point there was a method called ZipEntry.extract()
, but unfortunately things aren’t
that simple.
The designers of java.util.zip
decided to leave the actual hard work of extraction and insertion
of files entirely in the hands of the package user. The ZipFile
object provides you with a
ZipInputStream
object (upcast to InputStream
) via the getInputStream()
method, and the rest
is up to you. You read bytes from the stream (which transparently decompresses), and write the
output wherever you wish. I do this in the ZipExtract
sample program in method extractOneFile()
. I
read in 100K bytes at a time, and write them to the specified output file. Since the extraction
process from a ZipFile
is exceptionally fast, I can supply a progress update only once every
100K bytes and still seem fairly responsive.
A careful examination of the code in ZipExtract.java
reveals a shortcoming of the
java.util.zip
package. When extracting files using PKUNZIP.EXE
or unzip.exe
, we normally
expect file timestamps and protection bits to be set to the values stored in the Zip file. This
doesn’t happen anywhere in ZipExtract
, and it couldn’t even if I wanted it to. Java.util.zip
ducked its responsibilities in this area by not providing any sort of extract()
method. Worse
yet, the entire Java library leaves out the functions needed to do this on my own. If you want to
set timestamps or protection bits from Java, you are going to have to resort to native methods, a
decidedly un-PC proposition.
Creating Zip archives
The final sample program I wrote to illustrate this article is ZipCreate.java
, shown in
Listing 3. This is a simple command line program that is called with a zip file name as a command
line argument, followed by a list of wild card filespecs. ZipCreate
expands the wild card file
specs into a list of files, removes any duplicates, and creates a Zip file with the resulting
list. All this is done using a worker class I created called Zipper
.
Oddly enough, creating a Zip archive doesn’t involve the ZipFile
class. Method Zipper.create()
in Listing 3 shows the process, which is fairly simple. A new ZipOutputStream
object is created
using a standard filename. Files are then added to the output stream one at a time, using a
ZipEntry
object to control the process. The data that is written preceding the file in the Zip
archive is done using method ZipOutputStream.putNextEntry()
. The file data is then written
using a series of standard write()
calls, which compresses transparently. Once again, the lack
of a standard insert()
function means we have to do all the hard work ourselves. After all of
the file data is written, a call to ZipOutputStream.closeEntry()
writes the data that follows
the compressed file.
Listing 3 really highlights a few blank spots in the Java library. First of all, it would be
great to be able to use the Java API to expand wild card file specifications. The File.list()
method in Java provides the hooks to do just that, but the lack of a regular expression parser
means that third party solutions are required for implementation. I poked around on the net and
found a shareware package called pat 1.1 by Steve R. Brandt, which you can find at
http://www.javaregex.com/
.
It plugged in quickly and easily, and required only a small amount of code to integrate with my app.
Another glaring oversight in the Java API is found in the ZipEntry
class definition. Although
ZipEntry
is used to carry around information about a file, such as its length, timestamps, and
protection bits, none of this data is created automatically! If you want your Zip file to contain
accurate timestamps and protection setttings for your file, you are going to have to enter them
yourself. And even worse, it appears that you will have to resort to native methods, because the
Java library doesn’t have the functions you need to do this in a platform independent fashion.
Exceptions
Throughout these examples I’ve only glossed over the topic of errors. Of course, one of the
really nice things about Java is that you can get away with a casual treatment of errors. All of
my programs feature a try/catch block at a high level, which means any fatal errors thrown by
java.util.zip
will be caught and printed out when they occur. So rather than constantly checking
flags and status bits after every library call, I can proceed as if everything works perfectly,
knowing full well that errors will be caught somewhere else if and when they occur.
When I compare this sort of error handling to that I needed to use when writing demo programs for my C++ Zip library at Greenleaf Software , it’s easy to see why Java is a really great language for applications. Handling errors at a high level makes the rest of your code quite a bit easier to read, write, and maintain.
Conclusions
The creation and inclusion in Java of packages such as java.util.zip
is a good move by Sun.
This sort of utility is bound to help convince people that Java is more than just a toy language
for demo applets on the World Wide Web. Personally, the idea that I can write utility and demo
programs that have a high level of platform independence is really exciting. I’ve had pretty good
luck writing command line C and C++ programs that port between various platforms and compilers,
but never this easily. And I never even considered the idea of trying to write portable GUI
programs. All this is much more feasible now.
On the downside, java.util.zip
is presently a fairly shallow package. First of all, it is far
too easy to create Zip files that are going to be unusable by other programs. Sun’s classes don’t
check for validity of things such as file names, extra data, time stamps, and so on. Second, Sun
doesn’t provide support for low level file attribute manipulation, which is really needed for a
good package. And finally, it would be a good idea to make java.util.zip
a little friendlier by
adding functions to actually perform the insertion and extraction of files.
Compiler vendors have always had trouble deciding whether they wanted to do full scale library development. Every vendor has a few half-hearted library components, such as MS-DOS graphics libraries, complex number libraries, or container classes. Some library efforts turn into real products, Microsoft’s MFC for example. Right now Sun has only dipped a toe into the water, time will tell whether they decide to dive in or not.
Source Code
- Listing 1: ZipList.java
- Listing 2: ZipExtract.java
- Listing 3: ZipCreate.java
- patbin113.zip