Day 15: How a .gz file is structured, redux
So I’ve finished implementing gunzip in Julia! I paired with a lot of wonderful people along the way, so I ended up trying to explain how the file format works a lot of times, probably often in a fairly confused way (sorry!). Here is another disorganized attempt at explaining gzip, mainly for my own benefit.
The basic idea behind gzip (aka the DEFLATE algorithm) is that there are 2 passes:
- LL87 pass: Replace repeated sequences with pointers to where they appeared previously in the text. There is a good diagram of this at the beginning of this page
- Huffman coding pass: Use Huffman coding to represent the text and pointers in an efficient way
There is a pretty good explanation of what Huffman coding is on the aforementioned page, so I’m just going to go ahead and talk about the gzip file format. Probably this will make no sense if you have not been absorbed in gzip for the last 5 days.
That said, here is how a gzip file is laid out, and some of the code I wrote to decode it. The github repository is here: github.com/jvns/gzip.jl
Gzip headers and metadata (20 bytes or so)
This part of the file isn’t that interesting. Every gzip file has to start
with the magic bits 1F86
. There is a 9 byte header, followed by some
variable length metadata, which includes the filename and some other stuff.
I am not sure why the filename is included in the metadata!
Apparently gunzip also knows how to deflate some other legacy compression
format, but for our purposes compression_method
is always 8
.
Here are the Julia data structures I used to store the header & metadata:
type GzipHeader
id::Vector{Uint8} # length 2
compression_method::Uint8
flags::GzipFlags
mtime::Vector{Uint8} # length 4
extra_flags::Uint8
os::Uint8
end
type GzipMetadata
header::GzipHeader
xlen::Uint16
extra::ASCIIString
fname::ASCIIString
fcomment::ASCIIString
crc16::Uint16
end
Blocks!
The rest of the gzip file after the headers and metadata is a series of blocks, to be read in order. As far as I can tell it is impossible to decompress gzip files in parallel – there’s no way to know when a block has ended without fully decompressing it.
Block header (3 bits)
Each block starts with 3 bits indicating
- Whether this block is the last block (1 bit)
- How the block is compressed (2 bits). My gunzip only supports one compression method, though one of the options is “not compressed”, so that would be easy to add.
Header for the Huffman codes (14 bits)
There is a recursive thing going on where there is a Huffman tree encoded with another Huffman tree. This header is the metadata that helps you get started with reading all these trees.
hclen
is the number of codes in the first tree (minus four)hlit
: the number of ’literal codes’ in the second treehdist
: the number of ‘distance codes’ (minus one) in the second tree
This minus four and minus one business basically drove me insane. But they are easy to represent at least:
type HuffmanHeader
hlit::Uint8
hdist::Uint8
hclen::Uint8
end
Code lengths for first Huffman tree ((hclen + 4) * 3
bits)
Next, there is a series of numbers from 0 to 7 which you can use to put
together a Huffman tree. I read these in read_first_tree()
.
Code lengths for second Huffman tree (unknown number of bits)
The second Huffman tree’s code lengths are encoded using the first Huffman tree.
So you have to read (258 + hlit + hdist)
code lengths, using the first
Huffman tree as your guide. I read these in read_second_tree_codes()
You actually end up with two Huffman trees here – they’re called
literal_tree
and distance_tree
. literal_tree
is made from the first
257 + hlit
codes, and distance_tree
is made from the next hdist + 1
codes. This was really not obvious and took me forever to figure out.
The compressed data! (unknown number of bits)
To read the compressed data, you have to
- Read a code
- If it’s the stop code, stop!
- If it represents a literal character (like ‘a’ or ‘2’), just add it to the decoded text
- If it instead represents a pointer to some previous text, figure out the length and the relative distance and copy the text.
Here is the actual code I wrote to do this!
function inflate_block!(decoded_text, bs::BitStream, literal_tree::HuffmanTree, distance_tree::HuffmanTree)
while true
# Read a code from the file
code = read_huffman_bits(bs, literal_tree)
if code == 256 # Stop code; end of block!
break
end
if code <= 255 # ASCII character
append!(decoded_text, [convert(Uint8, code)])
else # Pointer to previous text
len = read_length_code(bs, code)
distance = read_distance_code(bs, distance_tree)
# Copy the previous text into our decoded text
copy_text!(decoded_text, distance, len)
end
end
return decoded_text
end
Main function for reading a block!
And here’s the main block-reading function! It does a lot of things, but it came out in (I thought) a fairly readable way.
It modifies decoded_text
in place because it turns out that blocks can refer
to text in previous blocks. So there needs to be some shared state.
function inflate_block!(decoded_text, bs::BitStream)
head = read(bs, HuffmanHeader)
# Read the first Huffman tree
first_tree = read_first_tree(bs, head.hclen)
# Read the codes for the second Huffman tree
codes = read_second_tree_codes(bs, head, first_tree)
# Put together the tree of literals (0 - 255), stop code (256), and length codes (257-285ish)
literal_codes = codes[1:257 + head.hlit]
lit_code_table = create_code_table(literal_codes, [0:length(literal_codes)-1])
literal_tree = create_huffman_tree(lit_code_table)
# Put together the tree of distance codes (0-17)
distance_codes = codes[end-head.hdist:end]
dist_code_table = create_code_table(distance_codes, [0:length(distance_codes)-1])
distance_tree = create_huffman_tree(dist_code_table)
return inflate_block!(decoded_text, bs, literal_tree, distance_tree)
end