diff options
Diffstat (limited to 'gnu/usr.bin/gzip/algorithm.doc')
-rw-r--r-- | gnu/usr.bin/gzip/algorithm.doc | 164 |
1 files changed, 164 insertions, 0 deletions
diff --git a/gnu/usr.bin/gzip/algorithm.doc b/gnu/usr.bin/gzip/algorithm.doc new file mode 100644 index 00000000000..24f7619ab69 --- /dev/null +++ b/gnu/usr.bin/gzip/algorithm.doc @@ -0,0 +1,164 @@ +1. Algorithm + +The deflation algorithm used by zip and gzip is a variation of LZ77 +(Lempel-Ziv 1977, see reference below). It finds duplicated strings in +the input data. The second occurrence of a string is replaced by a +pointer to the previous string, in the form of a pair (distance, +length). Distances are limited to 32K bytes, and lengths are limited +to 258 bytes. When a string does not occur anywhere in the previous +32K bytes, it is emitted as a sequence of literal bytes. (In this +description, 'string' must be taken as an arbitrary sequence of bytes, +and is not restricted to printable characters.) + +Literals or match lengths are compressed with one Huffman tree, and +match distances are compressed with another tree. The trees are stored +in a compact form at the start of each block. The blocks can have any +size (except that the compressed data for one block must fit in +available memory). A block is terminated when zip determines that it +would be useful to start another block with fresh trees. (This is +somewhat similar to compress.) + +Duplicated strings are found using a hash table. All input strings of +length 3 are inserted in the hash table. A hash index is computed for +the next 3 bytes. If the hash chain for this index is not empty, all +strings in the chain are compared with the current input string, and +the longest match is selected. + +The hash chains are searched starting with the most recent strings, to +favor small distances and thus take advantage of the Huffman encoding. +The hash chains are singly linked. There are no deletions from the +hash chains, the algorithm simply discards matches that are too old. + +To avoid a worst-case situation, very long hash chains are arbitrarily +truncated at a certain length, determined by a runtime option (zip -1 +to -9). So zip does not always find the longest possible match but +generally finds a match which is long enough. + +zip also defers the selection of matches with a lazy evaluation +mechanism. After a match of length N has been found, zip searches for a +longer match at the next input byte. If a longer match is found, the +previous match is truncated to a length of one (thus producing a single +literal byte) and the longer match is emitted afterwards. Otherwise, +the original match is kept, and the next match search is attempted only +N steps later. + +The lazy match evaluation is also subject to a runtime parameter. If +the current match is long enough, zip reduces the search for a longer +match, thus speeding up the whole process. If compression ratio is more +important than speed, zip attempts a complete second search even if +the first match is already long enough. + +The lazy match evaluation is no performed for the fastest compression +modes (speed options -1 to -3). For these fast modes, new strings +are inserted in the hash table only when no match was found, or +when the match is not too long. This degrades the compression ratio +but saves time since there are both fewer insertions and fewer searches. + + +2. gzip file format + +The pkzip format imposes a lot of overhead in various headers, which +are useful for an archiver but not necessary when only one file is +compressed. gzip uses a much simpler structure. Numbers are in little +endian format, and bit 0 is the least significant bit. +A gzip file is a sequence of compressed members. Each member has the +following structure: + +2 bytes magic header 0x1f, 0x8b (\037 \213) +1 byte compression method (0..7 reserved, 8 = deflate) +1 byte flags + bit 0 set: file probably ascii text + bit 1 set: continuation of multi-part gzip file + bit 2 set: extra field present + bit 3 set: original file name present + bit 4 set: file comment present + bit 5 set: file is encrypted + bit 6,7: reserved +4 bytes file modification time in Unix format +1 byte extra flags (depend on compression method) +1 byte operating system on which compression took place + +2 bytes optional part number (second part=1) +2 bytes optional extra field length +? bytes optional extra field +? bytes optional original file name, zero terminated +? bytes optional file comment, zero terminated +12 bytes optional encryption header +? bytes compressed data +4 bytes crc32 +4 bytes uncompressed input size modulo 2^32 + +The format was designed to allow single pass compression without any +backwards seek, and without a priori knowledge of the uncompressed +input size or the available size on the output media. If input does +not come from a regular disk file, the file modification time is set +to the time at which compression started. + +The time stamp is useful mainly when one gzip file is transferred over +a network. In this case it would not help to keep ownership +attributes. In the local case, the ownership attributes are preserved +by gzip when compressing/decompressing the file. A time stamp of zero +is ignored. + +Bit 0 in the flags is only an optional indication, which can be set by +a small lookahead in the input data. In case of doubt, the flag is +cleared indicating binary data. For systems which have different +file formats for ascii text and binary data, the decompressor can +use the flag to choose the appropriate format. + +The extra field, if present, must consist of one or more subfields, +each with the following format: + + subfield id : 2 bytes + subfield size : 2 bytes (little-endian format) + subfield data + +The subfield id can consist of two letters with some mnemonic value. +Please send any such id to jloup@chorus.fr. Ids with a zero second +byte are reserved for future use. The following ids are defined: + + Ap (0x41, 0x70) : Apollo file type information + +The subfield size is the size of the subfield data and does not +include the id and the size itself. The field 'extra field length' is +the total size of the extra field, including subfield ids and sizes. + +It must be possible to detect the end of the compressed data with any +compression format, regardless of the actual size of the compressed +data. If the compressed data cannot fit in one file (in particular for +diskettes), each part starts with a header as described above, but +only the last part has the crc32 and uncompressed size. A decompressor +may prompt for additional data for multipart compressed files. It is +desirable but not mandatory that multiple parts be extractable +independently so that partial data can be recovered if one of the +parts is damaged. This is possible only if no compression state is +kept from one part to the other. The compression-type dependent flags +can indicate this. + +If the file being compressed is on a file system with case insensitive +names, the original name field must be forced to lower case. There is +no original file name if the data was compressed from standard input. + +Compression is always performed, even if the compressed file is +slightly larger than the original. The worst case expansion is +a few bytes for the gzip file header, plus 5 bytes every 32K block, +or an expansion ratio of 0.015% for large files. Note that the actual +number of used disk blocks almost never increases. + +The encryption is that of zip 1.9. For the encryption check, the +last byte of the decoded encryption header must be zero. The time +stamp of an encrypted file might be set to zero to avoid giving a clue +about the construction of the random header. + +Jean-loup Gailly +jloup@chorus.fr + +References: + +[LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data +Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3, +pp. 337-343. + +APPNOTE.TXT documentation file in PKZIP 1.93a. It is available by +ftp in ftp.cso.uiuc.edu:/pc/exec-pc/pkz193a.exe [128.174.5.59] +Use "unzip pkz193a.exe APPNOTE.TXT" to extract. |