LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
Go to file
m3b 06a191b8de fix problems in LevelDB's caching code
Background:

LevelDB uses a cache (util/cache.h, util/cache.cc) of (key,value)
pairs for two purposes:
- a cache of (table, file handle) pairs
- a cache of blocks

The cache places the (key,value) pairs in a reference-counted
wrapper.  When it returns a value, it returns a reference to this
wrapper.  When the client has finished using the reference and
its enclosed (key,value), it calls Release() to decrement the
reference count.

Each (key,value) pair has an associated resource usage.  The
cache maintains the sum of the usages of the elements it holds,
and removes values as needed to keep the sum below a capacity
threshold.  It maintains an LRU list so that it will remove the
least-recently used elements first.

The max_open_files option to LevelDB sets the size of the cache
of (table, file handle) pairs.  The option is not used in any
other way.

The observed behaviour:

If LevelDB at any time used more file handles concurrently than
the cache size set via max_open_files, it attempted to reduce the
number by evicting entries from the table cache.  This could
happen most easily during compaction, and if max_open_files was
low.  Because the handles were in use, their reference count did
not drop to zero, and so the usage sum in the cache was not
modified by the evictions.  Subsequent Insert() calls returned
valid handles, but their entries were immediately evicted from
the cache, which though empty still acted as though full.  As a
result, there was effectively no caching, and the number of open
file handles rose []ly until it hit system-imposed limits and
the process died.

If one set max_open_files lower, the cache was more likely to
exhibit this beahviour, and cause the process to run out of file
descriptors.  That is, max_open_files acted in almost exactly the
opposite manner from what was intended.

The problems:

1. The cache kept all elements on its LRU list eligible for capacity
   eviction---even those with outstanding references from clients.  This was
   ineffective in reducing resource consumption because there was an
   outstanding reference, guaranteeing that the items remained.  A secondary
   issue was that there is no guarantee that these in-use items will be the
   last things reached in the LRU chain, which actually recorded
   "least-recently requested" rather than "least-recently used".

2. The sum of usages was decremented not when a (key,value) was evicted from
   the cache, but when its reference count went to zero.  Thus, when things
   were removed from the cache, either by garbage collection or via Erase(),
   the usage sum was not necessarily decreased.  This allowed the cache to act
   as though full when it was in fact not, reducing caching effectiveness, and
   leading to more resources being consumed---the opposite of what the
   evictions were intended to achieve.

3. (minor) The cache's clients insert items into it by first looking up the
   key, and inserting only if no value is found.  Although the cache has an
   internal lock, the clients use no locking to ensure atomicity of the
   Lookup/Insert pair.  (see table/table.cc:  block_cache->Insert() and
   db/table_cache.cc:  cache_->Insert()).  Thus, if two threads Insert() at
   about the same time, they can both Lookup(), find nothing, and both
   Insert().  The second Insert() would evict the first value, leaving each
   thread with a handle on its own version of the data, and with the second
   version in the cache.  It would be better if both threads ended up with a
   handle on the same (key,value) pair, which implies it must be the first item
   inserted.  This suggests that Insert() should not replace an existing value.

   This can be made safe with current usage inside LeveDB itself, but this is
   not easy to change first because Cache is a public interface, so to change
   the semantics of an existing call might break things, second because Cache
   is an abstract virtual class, so adding a new abstract virtual method may
   break other implementations, and third, the new method "insert without
   replacing" cannot be implemented in terms of the existing methods, so cannot
   be implemented with a non-abstract default.   But fortunately, the effects
   of this issue are minor, so this issue is not fixed by this change.

The changes:

The assumption in the fixes is that it is always better to cache
entries unless removal from the cache would lead to deallocation.

Cache entries now have an "in_cache" boolean indicating whether
the cache has a reference on the entry.  The only ways that this can
become false without the entry being passed to its "deleter" are via
Erase(), via Insert() when an element with a duplicate key is inserted,
or on destruction of the cache.

The cache now keeps two linked lists instead of one.  All items
in the cache are in one list or the other, and never both.  Items
still referenced by clients but erased from the cache are in
neither list.  The lists are:
- in-use:  contains the items currently referenced by clients, in no particular
  order.  (This list is used for invariant checking.  If we removed the check,
  elements that would otherwise be on this list could be left as disconnected
  singleton lists.)
- LRU:  contains the items not currently referenced by clients, in LRU order

A new internal Ref() method increments the reference count.  If
incrementing from 1 to 2 for an item in the cache, it is moved
from the LRU list to the in-use list.

The Unref() call now moves things from the in-use list to the LRU
list if the reference count falls to 1, and the item is in the
cache.  It no longer adjusts the usage sum.  The usage sum now
reflects only what is in the cache, rather than including
still-referenced items that have been evicted.

The LRU_Append() now takes a "list" parameter so that it can be
used to append either to the LRU list or the in-use list.

Lookup() is modified to use the new Ref() call, rather than
adjusting the reference count and LRU chain directly.

Insert() eviction code is also modified to adjust the usage sum and the
in_cache boolean of the evicted elements.  Some LevelDB tests assume that there
will be no caching whatsoever if the cache size is set to zero, so this is
handled as a special case.

A new private method FinishErase() is factored out
with the common code from where items are removed from the cache.

Erase() is modified to adjust the usage sum and the in_cache
boolean of the erased elements, and to use FinishErase().

Prune() is modified to use FinishErase() also, and to make use of the fact that
the lru_ list now contains only items with reference count 1.

- EvictionPolicy is modified to test that an entry with an
outstanding handle is not evicted.  This test fails with the old cache.cc.

- A new test case UseExceedsCacheSize verifies that even when the
cache is overfull of entries with outstanding handles, none are
evicted.  This test fails with the old cache.cc, and is the key
issue that causes file descriptors to run out when the cache
size is set too small.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=123247237
2016-07-06 09:15:53 -07:00
db Change std::uint64_t to uint64_t (#354) 2016-04-12 15:38:09 -07:00
doc include <assert> -> <cassert> 2015-12-09 10:34:57 -08:00
helpers/memenv LevelDB now attempts to reuse the preceding MANIFEST and log file when re-opened. 2015-08-11 14:56:39 -07:00
include/leveldb Add "approximate-memory-usage" property to leveldb::DB::GetProperty 2015-12-09 10:34:58 -08:00
issues Release LevelDB 1.14 2013-09-19 13:49:19 -07:00
port Merge pull request #272 from vapier/master 2016-01-12 10:47:33 -08:00
table Fix LevelDB build when asserts are enabled in release builds. (#367) 2016-04-15 10:58:27 -07:00
util fix problems in LevelDB's caching code 2016-07-06 09:15:53 -07:00
.gitignore Release leveldb 1.10 2013-05-14 17:03:07 -07:00
.travis.yml Added a Travis CI build file. 2016-01-14 17:41:48 -08:00
AUTHORS Release LevelDB 1.14 2013-09-19 13:49:19 -07:00
build_detect_platform Putting build artifacts in subdirectory. 2016-01-29 16:10:00 -08:00
CONTRIBUTING.md Release 1.18 2014-09-16 14:19:52 -07:00
LICENSE reverting disastrous MOE commit, returning to r21 2011-04-19 23:11:15 +00:00
Makefile Putting build artifacts in subdirectory. 2016-01-29 16:10:00 -08:00
NEWS sync with upstream @ 21409451 2011-05-21 02:17:43 +00:00
README.md add travis build badge 2016-01-15 18:29:01 +01:00
TODO Update to leveldb 1.6 2012-10-12 11:53:12 -07:00

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

Build Status

Authors: Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

Features

  • Keys and values are arbitrary byte arrays.
  • Data is stored sorted by key.
  • Callers can provide a custom comparison function to override the sort order.
  • The basic operations are Put(key,value), Get(key), Delete(key).
  • Multiple changes can be made in one atomic batch.
  • Users can create a transient snapshot to get a consistent view of data.
  • Forward and backward iteration is supported over the data.
  • Data is automatically compressed using the Snappy compression library.
  • External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.

Documentation

LevelDB library documentation is online and bundled with the source code.

Limitations

  • This is not a SQL database. It does not have a relational data model, it does not support SQL queries, and it has no support for indexes.
  • Only a single process (possibly multi-threaded) can access a particular database at a time.
  • There is no client-server support builtin to the library. An application that needs such support will have to wrap their own server around the library.

Contributing to the leveldb Project

The leveldb project welcomes contributions. leveldb's primary goal is to be a reliable and fast key/value store. Changes that are in line with the features/limitations outlined above, and meet the requirements below, will be considered.

Contribution requirements:

  1. POSIX only. We generally will only accept changes that are both compiled, and tested on a POSIX platform - usually Linux. Very small changes will sometimes be accepted, but consider that more of an exception than the rule.

  2. Stable API. We strive very hard to maintain a stable API. Changes that require changes for projects using leveldb might be rejected without sufficient benefit to the project.

  3. Tests: All changes must be accompanied by a new (or changed) test, or a sufficient explanation as to why a new (or changed) test is not required.

Submitting a Pull Request

Before any pull request will be accepted the author must first sign a Contributor License Agreement (CLA) at https://cla.developers.google.com/.

In order to keep the commit timeline linear squash your changes down to a single commit and rebase on google/leveldb/master. This keeps the commit timeline linear and more easily sync'ed with the internal repository at Google. More information at GitHub's About Git rebase page.

Performance

Here is a performance report (with explanations) from the run of the included db_bench program. The results are somewhat noisy, but should be enough to get a ballpark performance estimate.

Setup

We use a database with a million entries. Each entry has a 16 byte key, and a 100 byte value. Values used by the benchmark compress to about half their original size.

LevelDB:    version 1.1
Date:       Sun May  1 12:11:26 2011
CPU:        4 x Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
CPUCache:   4096 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Raw Size:   110.6 MB (estimated)
File Size:  62.9 MB (estimated)

Write performance

The "fill" benchmarks create a brand new database, in either sequential, or random order. The "fillsync" benchmark flushes data from the operating system to the disk after every operation; the other write operations leave the data sitting in the operating system buffer cache for a while. The "overwrite" benchmark does random writes that update existing keys in the database.

fillseq      :       1.765 micros/op;   62.7 MB/s
fillsync     :     268.409 micros/op;    0.4 MB/s (10000 ops)
fillrandom   :       2.460 micros/op;   45.0 MB/s
overwrite    :       2.380 micros/op;   46.5 MB/s

Each "op" above corresponds to a write of a single key/value pair. I.e., a random write benchmark goes at approximately 400,000 writes per second.

Each "fillsync" operation costs much less (0.3 millisecond) than a disk seek (typically 10 milliseconds). We suspect that this is because the hard disk itself is buffering the update in its memory and responding before the data has been written to the platter. This may or may not be safe based on whether or not the hard disk has enough power to save its memory in the event of a power failure.

Read performance

We list the performance of reading sequentially in both the forward and reverse direction, and also the performance of a random lookup. Note that the database created by the benchmark is quite small. Therefore the report characterizes the performance of leveldb when the working set fits in memory. The cost of reading a piece of data that is not present in the operating system buffer cache will be dominated by the one or two disk seeks needed to fetch the data from disk. Write performance will be mostly unaffected by whether or not the working set fits in memory.

readrandom   :      16.677 micros/op;  (approximately 60,000 reads per second)
readseq      :       0.476 micros/op;  232.3 MB/s
readreverse  :       0.724 micros/op;  152.9 MB/s

LevelDB compacts its underlying storage data in the background to improve read performance. The results listed above were done immediately after a lot of random writes. The results after compactions (which are usually triggered automatically) are better.

readrandom   :      11.602 micros/op;  (approximately 85,000 reads per second)
readseq      :       0.423 micros/op;  261.8 MB/s
readreverse  :       0.663 micros/op;  166.9 MB/s

Some of the high cost of reads comes from repeated decompression of blocks read from disk. If we supply enough cache to the leveldb so it can hold the uncompressed blocks in memory, the read performance improves again:

readrandom   :       9.775 micros/op;  (approximately 100,000 reads per second before compaction)
readrandom   :       5.215 micros/op;  (approximately 190,000 reads per second after compaction)

Repository contents

See doc/index.html for more explanation. See doc/impl.html for a brief overview of the implementation.

The public interface is in include/*.h. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Guide to header files:

  • include/db.h: Main interface to the DB: Start here

  • include/options.h: Control over the behavior of an entire database, and also control over the behavior of individual reads and writes.

  • include/comparator.h: Abstraction for user-specified comparison function. If you want just bytewise comparison of keys, you can use the default comparator, but clients can write their own comparator implementations if they want custom ordering (e.g. to handle different character encodings, etc.)

  • include/iterator.h: Interface for iterating over data. You can get an iterator from a DB object.

  • include/write_batch.h: Interface for atomically applying multiple updates to a database.

  • include/slice.h: A simple module for maintaining a pointer and a length into some other byte array.

  • include/status.h: Status is returned from many of the public interfaces and is used to report success and various kinds of errors.

  • include/env.h: Abstraction of the OS environment. A posix implementation of this interface is in util/env_posix.cc

  • include/table.h, include/table_builder.h: Lower-level modules that most clients probably won't use directly