GZIPping with Java

The gzip format is the de facto standard compression format in the UNIX and Linux worlds.  In a previous Dev Shed article titled Zip Meets Java, Kulvir demonstrated how to use the java.util.zip package to programmatically manipulate files in the ZIP format.  In this article, we’ll cover how to use the java.util.zip package to create and read files using the gzip format.

Some Background About GZIP

gzip (short for GNU zip) is a compression utility (mainly found on the *NIX platforms) that produces files with an extension of .gz.  gzip was created by by Jean-Loup Gailly and Mark Adler as a replacement for the compress utility and offers better compression ratios and an open (non-patented) compression algorithm. The sister utility, gunzip, is used to decompress files that are in the gzip format.  The intricacies of the gzip format are beyond the scope of this article. However, you can learn more about the compression and decompression algorithms used by gzip by following the link in our references section at the end of this article.  You can learn more about the format of the gzip files by reading RFCs 1951 and 1952.
 
Some of the key points include:

  1. A lossless compressed data format

  2. Data is compressed using the LZ77 algorithm and Huffman coding

  3. Format is not covered by patents, thus making it publicly usable without fear of legal repercussion

  4. Format includes a cyclic redundancy check value to detect data corruption and ensure data integrity

Since its inception, the gzip format has gained a popular following.  For example, the gzip utility has been formally adopted by the GNU project. There are free downloadable utilities for various platforms that can compress and decompress files in the gzip format.  In Java, the gzip functionality lives in the java.util.zip package, which has been around since Java 1.1. 

{mospagebreak title=GZIP vs. ZIP}

Unlike ZIP utilities, the gzip utility does not archive multiple files into one file.  On the contrary, gzip compresses one file into one gzip file.  In order to compress multiple files, one must first use another utility called TAR. TAR uses an archival algorithm to join multiple files into one file. 
 
Compressing a file that is already compressed doesn’t do much.  For example, GIF is a compressed image format, so when you ZIP or gzip a GIF file, the resulting file will usually be pretty similar in size to the original GIF file. However, if you TAR a number of GIF files into a single file and then gzip the TAR file, a significant size saving can be achieved. This is because there may be common strings found across the joined files that gzip can compress, thus further reducing the total size of the already compressed GIF files.
 
It is also important to note that GZIP does not support encryption whereas ZIP does. The makers of GZIP find the encryption provided by ZIP to be “weak.” They also suggest using PGP if the user desires strong encryption as well as compression. In fact, PGP incorporates the gzip compression code. See our references section for a link to the PGP Website.

{mospagebreak title=Enough… Let’s Write Some GZIP Code}

Programmatically creating a file in the gzip format is rather straightforward.  In the code below, we do just that.  We start off by creating a new GZIPOutputStream object. The constructor of GZIPOutputStream takes a FileOutputStream object. We instantiate the FileOutputStream object using a String that specifies the location where our gzip file should be created.
 
Next, we create an InputFileStream that points to the (uncompressed) file that we wish to compress. We create a byte[] to use as a buffer to write to the GZIPOutputStream.  Then we use a while loop to read our input file and write it to our GZIPOutputStream using our buffer. The process of creating the GZIP file is completed by using the finish and close methods on the GZIPOutputStream. Our sample code will create a file named examplegzip.gz in a directory named c:articles, using the file statebirds.txt from the same directory as the source file to be zipped.
 
[code]
System.out.println("Creating gzip file.");
// Specify gzip file name
String gzipFileName = "c:\articles\examplegzip.gz";
GZIPOutputStream out = new GZIPOutputStream(new FileOutputStream(gzipFileName));
 

// Specify the input file to be compressed
String inFilename = “c:\articles\statebirds.txt";
FileInputStream in = new FileInputStream(inFilename);
 

// Transfer bytes from the input file
// to the gzip output stream
byte[] buf = new byte[1024];
int len;
while ((len = in.read(buf)) > 0) {
out.write(buf, 0, len);
}
in.close();
 

// Finish creation of gzip file
out.finish();
out.close();
[/code]
 
After you are finished writing the file using the GZIPOutputStream, a checksum value representing the original (uncompressed) data is stored in the trailer of the GZIP file.
 
In the example above, we are reading from a file. Of course, because we are using an InputStream we could just as easily read from another data source, such as a socket.

{mospagebreak title=And the Other Way Around: Uncompressing Your GZIP Files (Programmatically)}

Decompressing a gzip file with the java.util.zip API is pretty straightforward too.  The class below decompresses the file we created in the previous section. The class begins by creating a new GZIPInputStream stream object. We pass the GZIPInputStream constructor the filename of the gzip file we want to decompress (i.e., c:articlesexamplegzip.gz). Next, we use the same byte buffer technique to read from the GZIPInputStream object and write to a FileOutputStream object. To clean things up, we close our GZIPInputStream and our OutputStream. Our sample program will create a file named statebirdsclone.txt, which is exactly the same as its original, statebirds.txt that we compressed in the previous section.
 
[code]
// Open the gzip file
String inFilename = "c:\articles\examplegzip.gz";
GZIPInputStream gzipInputStream =
new GZIPInputStream(new FileInputStream(inFilename));
// Open the output file
String outFilename = "c:\articles\statebirdsclone.txt";
OutputStream out = new FileOutputStream(outFilename);
 

// Transfer bytes from the compressed file to the output file
byte[] buf = new byte[1024];
int len;
while ((len = gzipInputStream.read(buf)) > 0) {
out.write(buf, 0, len);
}
 

// Close the file and stream
gzipInputStream.close();
out.close();
[/code]
 
After reading the entire file using the GZIPInputStream, the read method will compare the checksum of the decompressed data with the checksum that is stored in the trailer of the GZIP file.  If the values do not match, the read method will throw an exception.

{mospagebreak title=Leveraging GZIP in the Web World}

One very powerful application of the GZIP package is content compression before a Servlet response from a J2EE Web application.  Many application servers such as IBM WebSphere Application Server support Servlet Filtering, which is the act of modifying a Servlet’s response before sending it back to the client.  You can fairly easily create a Servlet filter that uses GZIP to compress the content that you are returning to a browser.  Most modern browsers have built-in functionality to decompress content that is returned in GZIP format.  For Websites that return a large amount of content, this technique can  tremendously reduce overall response time by minimizing network transfer time (which is often the bulk of the overall transaction).  Please refer to our references section for a link to Jayson Falkner’s article, which gives a detailed description on implementing a Servlet filter that does GZIP compression.
 
 
Conclusion

 

The gzip utility (and algorithm) were developed as a free alternative to the compress utility, which uses a patented algorithm.  You can learn more about these utilities at the GZIP website listed in our references section.  In this article, we showed you how to use the java.util.zip package to programmatically create and read files in the gzip format. The programmatic ability offered by the package is indispensable for programmers who want to write their own Java programs to create and read gzip format files.
 
References

Google+ Comments

Google+ Comments