format?that uses a combination of the?LZ77?algorithm and?Huffman
coding. It was originally defined by?Phil Katz?for version 2 of his?PKZIP?archiving
tool. The file format was later specified in RFC1951.
The original algorithm as designed by Katz was patented as?U.S. Patent 5,051,745?and assigned to?PKWARE,
stated in the RFC document, an algorithm producing DEFLATE files is widely thought to be implementable in a manner not covered by patents.?This
has led to its widespread use, for example in?gzip?compressed files,?PNG?image
files and the?ZIP?file format for which Katz originally designed it.
A Deflate stream consists of a series of blocks. Each block is preceded by a 3-bit?header:
The?stored?block option adds minimal overhead, and is used for data that is incompressible.
Most compressible data will end up being encoded using method?, the?dynamic Huffman?encoding,
which produces an optimised Huffman tree customised for each block of data individually. Instructions to generate the necessary Huffman tree immediately follow the block header. The static Huffman option is used for short messages, where the fixed saving gained
by omitting the tree outweighs the percentage compression loss due to using a non-optimal (thus, not technically Huffman) code.
Compression is achieved through two steps:
Within compressed blocks, if a duplicate series of bytes is spotted (a repeated string), then a back-reference?is
inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of an 8-bit length (3–258 bytes) and a 15-bit distance (1–32,768 bytes) to the beginning of the duplicate. Relative back-references
can be made across any number of blocks, as long as the distance appears within the last 32?KB?of uncompressed data decoded (termed
If the distance is less than the length, the duplicate overlaps itself, indicating repetition. For example, a run of 10 identical bytes can be encoded as one byte, followed by a duplicate of length 9 beginning 1 byte ago.
The second compression stage consists of replacing commonly used symbols with shorter representations and less commonly used symbols with longer representations. The method used is?Huffman
coding?which creates an unprefixed tree of non-overlapping intervals, where the length of each sequence is inversely proportional to the probability of that symbol needing to be encoded. The more likely a symbol has to be encoded, the shorter its bit-sequence
A tree is created, containing space for 288 symbols:
A match length code will always be followed by a distance code. Based on the distance code read, further "extra" bits may be read in order to produce the final distance. The distance tree contains space for 32 symbols: