Update readme.md

2024-12-27 13:33:18 +08:00 · 2019-06-21 09:02:42 -07:00 · 2019-06-21 09:02:42 -07:00 · 6208e51415
commit 6208e51415
parent 69efa50a0d
1 changed files with 11 additions and 251 deletions
--- a/readme.md
+++ b/readme.md
@ -367,257 +367,6 @@ how the design of _tbb_ avoids the false cache line sharing.



-<!--
-
-## Tested Allocators
-
-We tested _mimalloc_ with 9 leading allocators over 12 benchmarks
-and the SpecMark benchmarks. The tested allocators are:
-
- mi: The _mimalloc_ allocator, using version tag `v1.0.0`.
-  We also test a secure version of _mimalloc_ as smi which uses
-  the techniques described in Section [#sec-secure].
- tc: The [_tcmalloc_](https://github.com/gperftools/gperftools)
-  allocator which comes as part of
-  the Google performance tools and is used in the Chrome browser.
-  Installed as package `libgoogle-perftools-dev` version
-  `2.5-2.2ubuntu3`.
- je: The [_jemalloc_](https://github.com/jemalloc/jemalloc)
-  allocator by Jason Evans is developed at Facebook
-  and widely used in practice, for example in FreeBSD and Firefox.
-  Using version tag 5.2.0.
- sn: The [_snmalloc_](https://github.com/microsoft/snmalloc) allocator
-  is a recent concurrent message passing
-  allocator by Liétar et al. \[8]. Using `git-0b64536b`.
- rp: The [_rpmalloc_](https://github.com/rampantpixels/rpmalloc) allocator
-   uses 32-byte aligned allocations and is developed by Mattias Jansson at Rampant Pixels.
-   Using version tag 1.3.1.
- hd: The [_Hoard_](https://github.com/emeryberger/Hoard) allocator by
-  Emery Berger \[1]. This is one of the first
-  multi-thread scalable allocators. Using version tag 3.13.
- glibc: The system allocator. Here we use the _glibc_ allocator (which is originally based on
-  _Ptmalloc2_), using version 2.27.0. Note that version 2.26 significantly improved scalability over
-  earlier versions.
- sm: The [_Supermalloc_](https://github.com/kuszmaul/SuperMalloc) allocator by
-  Bradley Kuszmaul uses hardware transactional memory
-  to speed up parallel operations. Using version `git-709663fb`.
- tbb: The Intel [TBB](https://github.com/intel/tbb) allocator that comes with
-  the Thread Building Blocks (TBB) library \[7].
-  Installed as package `libtbb-dev`, version `2017~U7-8`.
-
-All allocators run exactly the same benchmark programs on Ubuntu 18.04.1
-and use `LD_PRELOAD` to override the default allocator. The wall-clock
-elapsed time and peak resident memory (_rss_) are measured with the
-`time` program. The average scores over 5 runs are used. Performance is
-reported relative to _mimalloc_, e.g. a time of 1.5&times; means that
-the program took 1.5&times; longer than _mimalloc_.
-
-[_snmalloc_]: https://github.com/Microsoft/_snmalloc_
-[_rpmalloc_]: https://github.com/rampantpixels/_rpmalloc_
-
-
-## Benchmarks
-
-The first set of benchmarks are real world programs and consist of:
-
- __cfrac__: by Dave Barrett, implementation of continued fraction factorization which
-  uses many small short-lived allocations -- exactly the workload
-  we are targeting for Koka and Lean.   
- __espresso__: a programmable logic array analyzer, described by
-  Grunwald, Zorn, and Henderson \[3]. in the context of cache aware memory allocation.
- __barnes__: a hierarchical n-body particle solver \[4] which uses relatively few
-  allocations compared to `cfrac` and `espresso`. Simulates the gravitational forces
-  between 163840 particles.
- __leanN__:  The [Lean](https://github.com/leanprover/lean) compiler by
-  de Moura _et al_, version 3.4.1,
-  compiling its own standard library concurrently using N threads
-  (`./lean --make -j N`). Big real-world workload with intensive
-  allocation.
- __redis__: running the [redis](https://redis.io/) 5.0.3 server on
-  1 million requests pushing 10 new list elements and then requesting the
-  head 10 elements. Measures the requests handled per second.
- __larsonN__: by Larson and Krishnan \[2]. Simulates a server workload using 100 separate
-   threads which each allocate and free many objects but leave some
-   objects to be freed by other threads. Larson and Krishnan observe this
-   behavior (which they call _bleeding_) in actual server applications,
-   and the benchmark simulates this.
-
-The second set of  benchmarks are stress tests and consist of:
-
- __alloc-test__: a modern allocator test developed by
-  OLogN Technologies AG ([ITHare.com](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/))
-  Simulates intensive allocation workloads with a Pareto size
-  distribution. The _alloc-testN_ benchmark runs on N cores doing
-  100&middot;10^6^ allocations per thread with objects up to 1KiB
-  in size. Using commit `94f6cb`
-  ([master](https://github.com/node-dot-cpp/alloc-test), 2018-07-04)
- __sh6bench__: by [MicroQuill](http://www.microquill.com/) as part of SmartHeap. Stress test
-   where some of the objects are freed in a
-   usual last-allocated, first-freed (LIFO) order, but others are freed
-   in reverse order. Using the
-   public [source](http://www.microquill.com/smartheap/shbench/bench.zip)
-   (retrieved 2019-01-02)
- __sh8benchN__: by [MicroQuill](http://www.microquill.com/) as part of SmartHeap. Stress test for
-  multi-threaded allocation (with N threads) where, just as in _larson_,
-  some objects are freed by other threads, and some objects freed in
-  reverse (as in _sh6bench_). Using the
-  public [source](http://www.microquill.com/smartheap/SH8BENCH.zip)
-  (retrieved 2019-01-02)
- __xmalloc-testN__: by Lever and Boreham \[5] and Christian Eder. We use the updated
-  version from the SuperMalloc repository. This is a more
-  extreme version of the _larson_ benchmark with 100 purely allocating threads,
-  and 100 purely deallocating threads with objects of various sizes migrating
-  between them. This asymmetric producer/consumer pattern is usually difficult
-  to handle by allocators with thread-local caches.
- __cache-scratch__: by Emery Berger \[1]. Introduced with the Hoard
-  allocator to test for _passive-false_ sharing of cache lines: first
-  some small objects are allocated and given to each thread; the threads
-  free that object and allocate immediately another one, and access that
-  repeatedly. If an allocator allocates objects from different threads
-  close to each other this will lead to cache-line contention.
-
-
-## On a 16-core AMD EPYC running Linux
-
-Testing on a big Amazon EC2 instance ([r5a.4xlarge](https://aws.amazon.com/ec2/instance-types/))
-consisting of a 16-core AMD EPYC 7000 at 2.5GHz
-with 128GB ECC memory, running	Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
-We excluded SuperMalloc here as it use transactional memory instructions
-that are usually not supported in a virtualized environment.
-
-![bench-r5a-1](doc/bench-r5a-1.svg)
-![bench-r5a-2](doc/bench-r5a-2.svg)
-
-Memory usage:
-
-![bench-r5a-rss-1](doc/bench-r5a-rss-1.svg)
-![bench-r5a-rss-1](doc/bench-r5a-rss-2.svg)
-
-(note: the _xmalloc-testN_ memory usage should be disregarded is it
-allocates more the faster the program runs).
-
-In the first five benchmarks we can see _mimalloc_ outperforms the other
-allocators moderately, but we also see that all these modern allocators
-perform well -- the times of large performance differences in regular
-workloads are over. In
-_cfrac_ and _espresso_, _mimalloc_ is a tad faster than _tcmalloc_ and
-_jemalloc_, but a solid 10\% faster than all other allocators on
-_espresso_. The _tbb_ allocator does not do so well here and lags more than
-20\% behind _mimalloc_. The _cfrac_ and _espresso_ programs do not use much
-memory (~1.5MB) so it does not matter too much, but still _mimalloc_ uses
-about half the resident memory of _tcmalloc_.
-
-The _leanN_ program is most interesting as a large realistic and
-concurrent workload and there is a 8% speedup over _tcmalloc_. This is
-quite significant: if Lean spends 20% of its time in the
-allocator that means that _mimalloc_ is 1.3&times; faster than _tcmalloc_
-here. This is surprising as that is *not* measured in a pure
-allocation benchmark like _alloc-test_. We conjecture that we see this
-outsized improvement here because _mimalloc_ has better locality in
-the allocation which improves performance for the *other* computations
-in a program as well.
-
-The _redis_ benchmark shows more differences between the allocators where
-_mimalloc_ is 14\% faster than _jemalloc_. On this benchmark _tbb_ (and _Hoard_) do
-not do well and are over 40\% slower.
-
-The _larson_ server workload which allocates and frees objects between
-many threads shows even larger differences, where _mimalloc_ is more than
-2.5&times; faster than _tcmalloc_ and _jemalloc_ which is quite surprising
-for these battle tested allocators -- probably due to the object
-migration between different threads. This is a difficult benchmark for
-other allocators too where _mimalloc_ is still 48% faster than the next
-fastest (_snmalloc_).
-
-
-The second benchmark set tests specific aspects of the allocators and
-shows even more extreme differences between them.
-
-The _alloc-test_ is very allocation intensive doing millions of
-allocations in various size classes. The test is scaled such that when an
-allocator performs almost identically on _alloc-test1_ as _alloc-testN_ it
-means that it scales linearly. Here, _tcmalloc_, _snmalloc_, and
-_Hoard_ seem to scale less well and do more than 10% worse on the
-multi-core version. Even the best allocators (_tcmalloc_ and _jemalloc_) are
-more than 10% slower as _mimalloc_ here.
-
-Also in _sh6bench_ _mimalloc_ does much
-better than the others (more than 2&times; faster than _jemalloc_).
-We cannot explain this well but believe it is
-caused in part by the "reverse" free-ing pattern in _sh6bench_.
-
-Again in _sh8bench_ the _mimalloc_ allocator handles object migration
-between threads much better and is over 36% faster than the next best
-allocator, _snmalloc_. Whereas _tcmalloc_ did well on _sh6bench_, the
-addition of object migration caused it to be almost 3 times slower
-than before.
-
-The _xmalloc-testN_ benchmark simulates an asymmetric workload where
-some threads only allocate, and others only free. The _snmalloc_
-allocator was especially developed to handle this case well as it
-often occurs in concurrent message passing systems. Here we see that
-the _mimalloc_ technique of having  non-contended sharded thread free
-lists pays off and it even outperforms _snmalloc_. Only _jemalloc_
-also handles this reasonably well, while the others underperform by
-a large margin. The optimization on _mimalloc_ to do a *delayed free*
-only once for full pages is quite important -- without it _mimalloc_
-is almost twice as slow (as then all frees contend again on the
-single heap delayed free list).
-
-
-The _cache-scratch_ benchmark also demonstrates the different
-architectures of the allocators nicely. With a single thread they all
-perform the same, but when running with multiple threads the allocator
-induced false sharing of the cache lines causes large run-time
-differences, where _mimalloc_ is more than 18&times; faster than _jemalloc_ and
-_tcmalloc_! Crundal \[6] describes in detail why the false cache line
-sharing occurs in the _tcmalloc_ design, and also discusses how this
-can be avoided with some small implementation changes.
-Only _snmalloc_ and _tbb_ also avoid the
-cache line sharing like _mimalloc_. Kukanov and Voss \[7] describe in detail
-how the design of _tbb_ avoids the false cache line sharing.
-The _Hoard_ allocator is also specifically
-designed to avoid this false sharing and we are not sure why it is not
-doing well here (although it runs still 5&times; as fast as _tcmalloc_).
-
-
-
-## On a 4-core Intel Xeon workstation
-
-Below are the benchmark results on an HP
-Z4-G4 workstation with a 4-core Intel® Xeon® W2123 at 3.6 GHz with 16GB
-ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
-
-![bench-z4-1](doc/bench-z4-1.svg)
-![bench-z4-2](doc/bench-z4-2.svg)
-
-Memory usage:
-
-![bench-z4-rss-1](doc/bench-z4-rss-1.svg)
-![bench-z4-rss-2](doc/bench-z4-rss-2.svg)
-
-(note: the _xmalloc-testN_ memory usage should be disregarded is it
-allocates more the faster the program runs).
-
-This time SuperMalloc (_sm_) is included as this platform supports
-hardware transactional memory. Unfortunately,
-there are no entries for _SuperMalloc_ in the _leanN_ and _xmalloc-testN_ benchmarks
-as it faulted on those. We also added the secure version of
-_mimalloc_ as smi.
-
-Overall, the relative results are quite similar as before. Most
-allocators fare better on the _larsonN_ benchmark now -- either due to
-architectural changes (AMD vs. Intel) or because there is just less
-concurrency. Unfortunately, the SuperMalloc faulted on the _leanN_
-and _xmalloc-testN_ benchmarks.
-
-The secure mimalloc version uses guard pages around each (_mimalloc_) page,
-encodes the free lists and uses randomized initial free lists, and we
-expected it would perform quite a bit worse -- but on the first benchmark set
-it performed only about 3% slower on average, and is second best overall.
-
-->
-
 # References

 - \[1] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson.
@ -651,3 +400,14 @@ it performed only about 3% slower on average, and is second best overall.
  Alex Shamis, Christoph M Wintersteiger, and David Chisnall.
  _Snmalloc: A Message Passing Allocator._
  In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, 122–135. ACM. 2019.
+
+
+# Contributing
+
+This project welcomes contributions and suggestions.  Most contributions require you to agree to a
+Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
+the rights to use your contribution. For details, visit https://cla.microsoft.com.
+
+When you submit a pull request, a CLA-bot will automatically determine whether you need to provide
+a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions
+provided by the bot. You will only need to do this once across all repos using our CLA.