[gclist] Allocation benchmarks
Wed, 1 Nov 2000 10:28:18 -0800
I just looked at a few of the other benchmarks used in the Hoard paper.
(These came from various sources, this is not a criticism of that paper
specifically.) It seems to me that at least the larson and
linux-scalability benchmarks have a problem that's serious enough to be
classified as a bug: They allocate large objects and then don't touch them.
This seems to be the canonical allocation benchmark bug. I hate to admit
it, but GCBench also has this problem, though to a much smaller extent. (It
should touch every cache line if the line length is at least 32 bytes, and
if the allocator doesn't add too much overhead.)
I believe it's a bug because:
1) Real applications don't behave that way. C applications sometimes don't
touch the entire allocated object, but they almost always touch major parts
2) It can completely skew the results. I don't believe the results of such
a benchmark are meaningful if it's used to compare very different
allocators. It's probably OK if only the synchronization scheme is changed.
Reasons it can affect the results:
1) Much of the allocation cost can be the cost of the initial cache miss(es)
to access the object. This sort of benchmark greatly favors allocators that
don't touch the object at all, e.g. by keeping the collector data structures
in a separate area. This is uncommon for malloc/free implementations, but
an allocator that touches both ends of the object, e.g. to deal with
boundary tags, will be similarly disadvantaged, if the object is bigger than
a cache line.
2) It largely hides memory bandwidth constraints on multiprocessors.
Scalability of real applications on many real machines (e.g. all commodity
X86 multiprocessors) will be somewhat (2 processors) to severely (4
processors) limited by the bandwidth needed to initialize objects. This
doesn't model that, and scalability results will be unrealistically good in
cases in which the allocator doesn't touch much of the object.
3) It makes the benchmark largely unusable with conservative garbage
collectors, which will typically touch the entire object to clear it on
allocation. It also makes it hard to compare with something like a Java
allocator, which usually has to do the same for security reasons.
I think most of these benchmarks can be "fixed" with a memset following each
allocation (or perhaps with a call to calloc instead of malloc). Can we
agree that this is a reasonable thing to do, and name the "fixed" version of
benchmark "XXX" as "XXX-memset"?
Of course we all agree that we really need more realistic benchmarks anyway,