diff options
Diffstat (limited to 'gnu/usr.bin/gzip/algorithm.doc')
-rw-r--r-- | gnu/usr.bin/gzip/algorithm.doc | 164 |
1 files changed, 0 insertions, 164 deletions
diff --git a/gnu/usr.bin/gzip/algorithm.doc b/gnu/usr.bin/gzip/algorithm.doc deleted file mode 100644 index 52b199172b3..00000000000 --- a/gnu/usr.bin/gzip/algorithm.doc +++ /dev/null @@ -1,164 +0,0 @@ -1. Algorithm - -The deflation algorithm used by zip and gzip is a variation of LZ77 -(Lempel-Ziv 1977, see reference below). It finds duplicated strings in -the input data. The second occurrence of a string is replaced by a -pointer to the previous string, in the form of a pair (distance, -length). Distances are limited to 32K bytes, and lengths are limited -to 258 bytes. When a string does not occur anywhere in the previous -32K bytes, it is emitted as a sequence of literal bytes. (In this -description, 'string' must be taken as an arbitrary sequence of bytes, -and is not restricted to printable characters.) - -Literals or match lengths are compressed with one Huffman tree, and -match distances are compressed with another tree. The trees are stored -in a compact form at the start of each block. The blocks can have any -size (except that the compressed data for one block must fit in -available memory). A block is terminated when zip determines that it -would be useful to start another block with fresh trees. (This is -somewhat similar to compress.) - -Duplicated strings are found using a hash table. All input strings of -length 3 are inserted in the hash table. A hash index is computed for -the next 3 bytes. If the hash chain for this index is not empty, all -strings in the chain are compared with the current input string, and -the longest match is selected. - -The hash chains are searched starting with the most recent strings, to -favor small distances and thus take advantage of the Huffman encoding. -The hash chains are singly linked. There are no deletions from the -hash chains, the algorithm simply discards matches that are too old. - -To avoid a worst-case situation, very long hash chains are arbitrarily -truncated at a certain length, determined by a runtime option (zip -1 -to -9). So zip does not always find the longest possible match but -generally finds a match which is long enough. - -zip also defers the selection of matches with a lazy evaluation -mechanism. After a match of length N has been found, zip searches for a -longer match at the next input byte. If a longer match is found, the -previous match is truncated to a length of one (thus producing a single -literal byte) and the longer match is emitted afterwards. Otherwise, -the original match is kept, and the next match search is attempted only -N steps later. - -The lazy match evaluation is also subject to a runtime parameter. If -the current match is long enough, zip reduces the search for a longer -match, thus speeding up the whole process. If compression ratio is more -important than speed, zip attempts a complete second search even if -the first match is already long enough. - -The lazy match evaluation is not performed for the fastest compression -modes (speed options -1 to -3). For these fast modes, new strings -are inserted in the hash table only when no match was found, or -when the match is not too long. This degrades the compression ratio -but saves time since there are both fewer insertions and fewer searches. - - -2. gzip file format - -The pkzip format imposes a lot of overhead in various headers, which -are useful for an archiver but not necessary when only one file is -compressed. gzip uses a much simpler structure. Numbers are in little -endian format, and bit 0 is the least significant bit. -A gzip file is a sequence of compressed members. Each member has the -following structure: - -2 bytes magic header 0x1f, 0x8b (\037 \213) -1 byte compression method (0..7 reserved, 8 = deflate) -1 byte flags - bit 0 set: file probably ascii text - bit 1 set: continuation of multi-part gzip file - bit 2 set: extra field present - bit 3 set: original file name present - bit 4 set: file comment present - bit 5 set: file is encrypted - bit 6,7: reserved -4 bytes file modification time in Unix format -1 byte extra flags (depend on compression method) -1 byte operating system on which compression took place - -2 bytes optional part number (second part=1) -2 bytes optional extra field length -? bytes optional extra field -? bytes optional original file name, zero terminated -? bytes optional file comment, zero terminated -12 bytes optional encryption header -? bytes compressed data -4 bytes crc32 -4 bytes uncompressed input size modulo 2^32 - -The format was designed to allow single pass compression without any -backwards seek, and without a priori knowledge of the uncompressed -input size or the available size on the output media. If input does -not come from a regular disk file, the file modification time is set -to the time at which compression started. - -The time stamp is useful mainly when one gzip file is transferred over -a network. In this case it would not help to keep ownership -attributes. In the local case, the ownership attributes are preserved -by gzip when compressing/decompressing the file. A time stamp of zero -is ignored. - -Bit 0 in the flags is only an optional indication, which can be set by -a small lookahead in the input data. In case of doubt, the flag is -cleared indicating binary data. For systems which have different -file formats for ascii text and binary data, the decompressor can -use the flag to choose the appropriate format. - -The extra field, if present, must consist of one or more subfields, -each with the following format: - - subfield id : 2 bytes - subfield size : 2 bytes (little-endian format) - subfield data - -The subfield id can consist of two letters with some mnemonic value. -Please send any such id to <gzip@gnu.org>. Ids with a zero second -byte are reserved for future use. The following ids are defined: - - Ap (0x41, 0x70) : Apollo file type information - -The subfield size is the size of the subfield data and does not -include the id and the size itself. The field 'extra field length' is -the total size of the extra field, including subfield ids and sizes. - -It must be possible to detect the end of the compressed data with any -compression format, regardless of the actual size of the compressed -data. If the compressed data cannot fit in one file (in particular for -diskettes), each part starts with a header as described above, but -only the last part has the crc32 and uncompressed size. A decompressor -may prompt for additional data for multipart compressed files. It is -desirable but not mandatory that multiple parts be extractable -independently so that partial data can be recovered if one of the -parts is damaged. This is possible only if no compression state is -kept from one part to the other. The compression-type dependent flags -can indicate this. - -If the file being compressed is on a file system with case insensitive -names, the original name field must be forced to lower case. There is -no original file name if the data was compressed from standard input. - -Compression is always performed, even if the compressed file is -slightly larger than the original. The worst case expansion is -a few bytes for the gzip file header, plus 5 bytes every 32K block, -or an expansion ratio of 0.015% for large files. Note that the actual -number of used disk blocks almost never increases. - -The encryption is that of zip 1.9. For the encryption check, the -last byte of the decoded encryption header must be zero. The time -stamp of an encrypted file might be set to zero to avoid giving a clue -about the construction of the random header. - -Jean-loup Gailly -gzip@gnu.org - -References: - -[LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data -Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3, -pp. 337-343. - -APPNOTE.TXT documentation file in PKZIP 1.93a. It is available by -ftp in ftp.cso.uiuc.edu:/pc/exec-pc/pkz193a.exe [128.174.5.59] -Use "unzip pkz193a.exe APPNOTE.TXT" to extract. |