Dr. Dobb’s Journal December, 1997
by Mark Nelson

Introduction

Sun has given Java developers a new set of library components that provide support for reading and writing Zip files. The addition of powerful compression and archiving abilities to Java is a boon to developers, who have traditionally had to rely on either third party libraries or proprietary tools for these functions. This article takes a quick look at how to use Sun’s java.util.zip package, and how to avoid a few common mistakes that the library overlooks.

The New Arrival

As I’m writing this, Sun has just released the 1.1 release of the Java Developer’s Kit, or JDK. For a point release, JDK 1.1 has quite a few major changes. One of those changes was the addition of JAR files. JAR files are simply compressed archives that contain various components of a Java applet, including class files, images, sound clips, etc. The JAR format should speed up the loading of applet components across the web, first by compressing data, and second by reducing the number of separate transactions required during the download process.

Sun has pitched its tent squarely in the middle of the kingdom of openness, so naturally the JAR file format needs to adhere to an accepted industry standard. Sun chose to use the ZIP file format, which had a couple of big advantages. First, commercial and free ZIP tools are widely available, including the deservedly revered zlib and InfoZip products. Just as importantly, the adoption of an open format like Zip allows Sun to taunt Microsoft, whose ActiveX technology requires that developers use the proprietary CAB format for compression and archiving.

To implement support for JAR files, Sun’s Java developers first ported most of zlib to pure Java. (Some critical code has been implemented using native methods.) This took care of implementing the deflate compression algorithm used in PKZip 2.x. They then created a fairly thin set of wrapper classes that are used to create the archive structure around the deflated data. The result is the java.util.zip package.

The venerable and inscrutable Zip format

Before we look at how java.util.zip deals with Zip processing, it helps to know a little bit about the structure of a Zip file. Figure 1 shows the layout of the Zip format. You can think of a Zip file as a stream oriented format, which means you can create an archive by writing sequentially without ever having to seek backwards in the output file.



Figure 1 – The Zip file format

The Zip file starts with a sequence of files, each of which can be compressed or stored in raw format. Each file has a local header immediately before its data, which contains most of the information about the file, including timestamps, compression method and file name. The compressed file contents immediately follow, and are terminated by an optional data descriptor. The data descriptor contains the file’s CRC and compressed size, which are frequently not available when writing the local file header. (If they are, the data descriptor can be skipped.)

Each file in the archive is laid down sequentially in this format, followed by a central directory at the end of the Zip archive. The central directory is a contiguous set of directory entries, each of which contains all the information in the local file header, plus extras such as file comments and attributes. Most importantly, the central directory contains pointers to the position of each file in the archive, which makes navigation of the Zip file quick and easy.

Key classes

The complete roster of package java.util.zip includes 14 classes, one interface, and two exception classes. While that may sound like a lot, most users will be able to skip over the bulk of the package. By concentrating on the four core classes of the package, you can perform the three most common operations on Zip files: creation, extraction, and directory reads. Those classes are:

ZipEntry The ZipEntry class contains all the information needed to
write both the local file header and the central directory record for a
specific file, such as file sizes, time stamps, comments, attributes, etc.
ZipFile ZipFile is used only when reading from Zip files. It provides access to the entire central directory for the Zip file, and produces ZipEntry objects on demand for any file in the archive.
ZipInputStream This input stream behaves like a standard InputStream, but works on deflated streams. Calls to the read() method on this stream will decompress bytes on the fly.
ZipOutputStream This output stream looks like an OutputStream, but deflates data transparently to the caller. Calls to write() methods will write data in the deflated format expected by the Zip standard.

Putting the classes to work

The first sample program, ZipList, is shown in Listing 1. This program has less than 50 lines of Java code, and it manages to print out the entire contents of a Zip file, including nicely formatted date and time stamps. To see the contents of foo.zip, ZipList.class can be executed using the following command:

 
java ZipList foo.zip

ZipList produces the following output for a sample Zip file:

 
Listing of : temp.zip

Raw Size   Size      Date       Time            Name 
-------- --------- --------- ---------- -------------------------- 
1981717  1718555   17-Mar-97 9:05:02 PM temp/Tjava.pdf 
547      272       20-Apr-97 1:36:40 PM TEMP.DAT 
92870    41915     11-Jul-95 9:50:00 AM COMMAND.COM
1536     110       03-Mar-97 4:55:50 AM ~OLEAPP.DOC

Coming up with this listing couldn’t be much easier. It involves just two steps. First, I create a new ZipFile object by calling the standard constructor with a filename:

 
ZipFile z = new ZipFile( args[ 0 ] );

Once the ZipFile object has been created, I call the entries() method, which returns an Enumeration object. The Enumeration object successively returns ZipEntry objects, one for each file in the Zip file. The methods in the ZipEntry object used to read various attributes include:

 
    String getComment() 
    long getCompressedSize() 
    long getCrc() 
    byte[] getExtra() 
    int getMethod() 
    String getName() 
    long getSize() 
    long getTime() 
    boolean isDirectory() 

With these methods, you should be able to follow the code in Listing 1 quite easily. Java makes the job even easier by providing built in classes to deal with dates, times, and strings.

Extracting Files

The first sample program accompanying this article shows you how to list the contents of a Zip file. I built on that knowledge to create the program shown in Listing 2, ZipExtract.java. This GUI program lets the user enter a zip file name in a simple text box, then list the files in the Zip file in a list box. The user can then select files, and extract them in a batch process. The program is shown in action in Figure 2.



Figure 2 – ZipExtract.java at work

The method used to load up the list box in ZipExtract is a stripped down variation of the code in ZipList. Since the list box only contains the file name, the enumeration loop only has to call ZipEntry.getName().

The majority of the non-UI code in ZipExtract is found in the two methods that are called when the Extract Files button is pressed. The extractFiles() method consists of a loop that is run through once per selected file name in the list. The routine calls ZipFile.getEntry() for each file to get the ZipEntry object associated with the given file name. It would be nice if at that point there was a method called ZipEntry.extract(), but unfortunately things aren’t that simple.

The designers of java.util.zip decided to leave the actual hard work of extraction and insertion of files entirely in the hands of the package user. The ZipFile object provides you with a ZipInputStream object (upcast to InputStream) via the getInputStream() method, and the rest is up to you. You read bytes from the stream (which transparently decompresses), and write the output wherever you wish. I do this in the ZipExtract sample program in method extractOneFile(). I read in 100K bytes at a time, and write them to the specified output file. Since the extraction process from a ZipFile is exceptionally fast, I can supply a progress update only once every 100K bytes and still seem fairly responsive.

A careful examination of the code in ZipExtract.java reveals a shortcoming of the java.util.zip package. When extracting files using PKUNZIP.EXE or unzip.exe, we normally expect file timestamps and protection bits to be set to the values stored in the Zip file. This doesn’t happen anywhere in ZipExtract, and it couldn’t even if I wanted it to. Java.util.zip ducked its responsibilities in this area by not providing any sort of extract() method. Worse yet, the entire Java library leaves out the functions needed to do this on my own. If you want to set timestamps or protection bits from Java, you are going to have to resort to native methods, a decidedly un-PC proposition.

Creating Zip archives

The final sample program I wrote to illustrate this article is ZipCreate.java, shown in Listing 3. This is a simple command line program that is called with a zip file name as a command line argument, followed by a list of wild card filespecs. ZipCreate expands the wild card file specs into a list of files, removes any duplicates, and creates a Zip file with the resulting list. All this is done using a worker class I created called Zipper.

Oddly enough, creating a Zip archive doesn’t involve the ZipFileclass. Method Zipper.create() in Listing 3 shows the process, which is fairly simple. A new ZipOutputStream object is created using a standard filename. Files are then added to the output stream one at a time, using a ZipEntry object to control the process. The data that is written preceding the file in the Zip archive is done using method ZipOutputStream.putNextEntry(). The file data is then written using a series of standard write() calls, which compresses transparently. Once again, the lack of a standard insert() function means we have to do all the hard work ourselves. After all of the file data is written, a call to ZipOutputStream.closeEntry() writes the data that follows the compressed file.

Listing 3 really highlights a few blank spots in the Java library. First of all, it would be great to be able to use the Java API to expand wild card file specifications. The File.list() method in Java provides the hooks to do just that, but the lack of a regular expression parser means that third party solutions are required for implementation. I poked around on the net and found a shareware package called pat 1.1 by Steve R. Brandt, which you can find at http://www.javaregex.com/. It plugged in quickly and easily, and required only a small amount of code to integrate with my app.

Another glaring oversight in the Java API is found in the ZipEntryclass definition. Although ZipEntry is used to carry around information about a file, such as its length, timestamps, and protection bits, none of this data is created automatically! If you want your Zip file to contain accurate timestamps and protection setttings for your file, you are going to have to enter them yourself. And even worse, it appears that you will have to resort to native methods, because the Java library doesn’t have the functions you need to do this in a platform independent fashion.

Exceptions

Throughout these examples I’ve only glossed over the topic of errors. Of course, one of the really nice things about Java is that you can get away with a casual treatment of errors. All of my programs feature a try/catch block at a high level, which means any fatal errors thrown by java.util.zip will be caught and printed out when they occur. So rather than constantly checking flags and status bits after every library call, I can proceed as if everything works perfectly, knowing full well that errors will be caught somewhere else if and when they occur.

When I compare this sort of error handling to that I needed to use when writing demo programs for my C++ Zip library at Greenleaf Software, it’s easy to see why Java is a really great language for applications. Handling errors at a high level makes the rest of your code quite a bit easier to read, write, and maintain.

Conclusions

The creation and inclusion in Java of packages such as java.util.zip is a good move by Sun. This sort of utility is bound to help convince people that Java is more than just a toy language for demo applets on the World Wide Web. Personally, the idea that I can write utility and demo programs that have a high level of platform independence is really exciting. I’ve had pretty good luck writing command line C and C++ programs that port between various platforms and compilers, but never this easily. And I never even considered the idea of trying to write portable GUI programs. All this is much more feasible now.

On the downside, java.util.zip is presently a fairly shallow package. First of all, it is far too easy to create Zip files that are going to be unusable by other programs. Sun’s classes don’t check for validity of things such as file names, extra data, time stamps, and so on. Second, Sun doesn’t provide support for low level file attribute manipulation, which is really needed for a good package. And finally, it would be a good idea to make java.util.zip a little friendlier by adding functions to actually perform the insertion and extraction of files.

Compiler vendors have always had trouble deciding whether they wanted to do full scale library development. Every vendor has a few half-hearted library components, such as MS-DOS graphics libraries, complex number libraries, or container classes. Some library efforts turn into real products, Microsoft’s MFC for example. Right now Sun has only dipped a toe into the water, time will tell whether they decide to dive in or not.

Source Code


Listing 1 ZipList.java
Listing 2 ZipExtract.java
Listing 3 ZipCreate.java
patbin113.zip