76 lines
2.9 KiB
Markdown
76 lines
2.9 KiB
Markdown
|
leveldb Log format
|
||
|
==================
|
||
|
The log file contents are a sequence of 32KB blocks. The only exception is that
|
||
|
the tail of the file may contain a partial block.
|
||
|
|
||
|
Each block consists of a sequence of records:
|
||
|
|
||
|
block := record* trailer?
|
||
|
record :=
|
||
|
checksum: uint32 // crc32c of type and data[] ; little-endian
|
||
|
length: uint16 // little-endian
|
||
|
type: uint8 // One of FULL, FIRST, MIDDLE, LAST
|
||
|
data: uint8[length]
|
||
|
|
||
|
A record never starts within the last six bytes of a block (since it won't fit).
|
||
|
Any leftover bytes here form the trailer, which must consist entirely of zero
|
||
|
bytes and must be skipped by readers.
|
||
|
|
||
|
Aside: if exactly seven bytes are left in the current block, and a new non-zero
|
||
|
length record is added, the writer must emit a FIRST record (which contains zero
|
||
|
bytes of user data) to fill up the trailing seven bytes of the block and then
|
||
|
emit all of the user data in subsequent blocks.
|
||
|
|
||
|
More types may be added in the future. Some Readers may skip record types they
|
||
|
do not understand, others may report that some data was skipped.
|
||
|
|
||
|
FULL == 1
|
||
|
FIRST == 2
|
||
|
MIDDLE == 3
|
||
|
LAST == 4
|
||
|
|
||
|
The FULL record contains the contents of an entire user record.
|
||
|
|
||
|
FIRST, MIDDLE, LAST are types used for user records that have been split into
|
||
|
multiple fragments (typically because of block boundaries). FIRST is the type
|
||
|
of the first fragment of a user record, LAST is the type of the last fragment of
|
||
|
a user record, and MIDDLE is the type of all interior fragments of a user
|
||
|
record.
|
||
|
|
||
|
Example: consider a sequence of user records:
|
||
|
|
||
|
A: length 1000
|
||
|
B: length 97270
|
||
|
C: length 8000
|
||
|
|
||
|
**A** will be stored as a FULL record in the first block.
|
||
|
|
||
|
**B** will be split into three fragments: first fragment occupies the rest of
|
||
|
the first block, second fragment occupies the entirety of the second block, and
|
||
|
the third fragment occupies a prefix of the third block. This will leave six
|
||
|
bytes free in the third block, which will be left empty as the trailer.
|
||
|
|
||
|
**C** will be stored as a FULL record in the fourth block.
|
||
|
|
||
|
----
|
||
|
|
||
|
## Some benefits over the recordio format:
|
||
|
|
||
|
1. We do not need any heuristics for resyncing - just go to next block boundary
|
||
|
and scan. If there is a corruption, skip to the next block. As a
|
||
|
side-benefit, we do not get confused when part of the contents of one log
|
||
|
file are embedded as a record inside another log file.
|
||
|
|
||
|
2. Splitting at approximate boundaries (e.g., for mapreduce) is simple: find the
|
||
|
next block boundary and skip records until we hit a FULL or FIRST record.
|
||
|
|
||
|
3. We do not need extra buffering for large records.
|
||
|
|
||
|
## Some downsides compared to recordio format:
|
||
|
|
||
|
1. No packing of tiny records. This could be fixed by adding a new record type,
|
||
|
so it is a shortcoming of the current implementation, not necessarily the
|
||
|
format.
|
||
|
|
||
|
2. No compression. Again, this could be fixed by adding new record types.
|