Irreproducible malloc benchmarks
by jwb, Nov. 26 2021

Allocators are important for the performance of C++ programs, so I try to stay abreast of developments to ensure that I’m picking the most suitable allocator in new projects. I don’t just blindly choose TCMalloc. I do always choose TCMalloc, but not blindly. In this pursuit the published benchmarks of mimalloc have troubled me. In their technical report mimalloc beats everybody else on all benchmarks, but I’ve never found a workload where mimalloc beats TCMalloc. I decided to see if I could reproduce the results in the technical report using their own code and TL;DR I can’t reproduce them. On hardware available to me, TCMalloc beats mimalloc on the “larsonN” benchmark, and so does jemalloc.

The odd result

The technical report includes this chart (on pg. 12) of the larsonN benchmark. It was written by Microsoft as an example of problematic server programs that allocate memory on one thread but deallocate it on another. The first three allocators, from top to bottom, are mimalloc, tcmalloc, and jemalloc. The scores are normalized to mimalloc, at 1.00. The results claim both jemalloc and tcmalloc need 3x the time to run this benchmark, compared to mimalloc.

The rematch

mimalloc-bench is available on GitHub. The instructions are a bit broken; if you follow them to the letter, it doesn’t work at all. One of the reasons I chose to focus on the larsonN benchmark is it is self-contained in a single C++ source file, so I could cherry-pick it out of the repo. Their test harness executes it thusly. It is called larson-sized because it is built with -DSIZED to enable C++ sized deallocation.
./larson-sized 5 8 1000 5000 100 4141 8
The combatants are:
  • mimalloc 2.0.3 (from the dev-slice branch of the repo)
  • tcmalloc ubuntu package, 2.9.1-0ubuntu2
  • jemalloc ubuntu package, 5.2.1-3
  • tcmalloc upstream commit 0255dd1 (repo has neither release nor tags 🤷 )
  • glibc ubuntu package, 2.34-0ubuntu3
Newer TCMalloc is statically linked using this Bazel file…
cc_binary(
    name = "larson-tcmalloc",
    srcs = ["larson.cpp"],
    linkstatic = True,
    malloc = "@com_google_tcmalloc//tcmalloc",
)
… and this invocation:
build -c opt --cxxopt=-std=c++20 --copt=-Wno-volatile --copt=-DCPP --copt=-DSIZED :all
The system is an Intel Core i5-8259U with 8 cores, locked to 3600MHz, a speed this system can sustain for the length of the tests. The compiler is GCC 11.2. Tests were run ten times.

Quirks

I tried mimalloc with both the dynamic override via LD_PRELOAD and statically linked into the program. This did not make a significant difference, and the average result for the static run was slower. 
Mimalloc offers an environment variable MIMALLOC_LARGE_OS_PAGES=1 which the documentation claims will cause it to use 2MiB pages if available. It seems to have the opposite effect in practice. Without this environment variable, mimalloc uses lots of huge pages on the system under test. With this set, it doesn’t use any. I’m not sure what’s going on there but it did not seem to make a significant difference in runtime, so I left it un-set. The environment variable does make mimalloc use much less memory, though, reinforcing the impression that it works backwards.

Results

Allocator
Mean MOps/second
95% Conf. Interval
Peak RSS (KiB)
glibc
70
69-71
61912
mimalloc
99
98-101
89800
jemalloc
104
102-106
97896
tcmalloc (old, dynamic)
113
112-115
52216
tcmalloc (new, static)
119
118-121
61776
TCMalloc and jemalloc are both significantly faster than mimalloc on this benchmark. TCMalloc also wins on space. Only glibc loses out to mimalloc. This continues my streak of being unable to find a workload favorable to mimalloc. The result of their technical report remain mysterious and contrary to my experience.