[gclist] ref-counting performance cost

Jeff Sturm jsturm1@home.com
Sat, 16 Sep 2000 00:37:27 -0400

Manoj Plakal wrote:
> How memory/allocation-intensive is your Sieve benchmark? It's
> a pretty small microbenchmark and artificially jacking up
> the number of primes you generate doesn't seem to generate
> meaningful numbers.

I took another look at this.  I still think the sieve is a reasonable
benchmark for Tcl and memory management:

- it uses associative arrays, of which the memory allocation graph is at
least 4 levels deep

- the script exercises the bytecode interpreter, which performs many
inline refcount operations

- memory access patterns are varied between sequential and random

- many temporary objects are allocated/released during expression

The sieve script computes primes up to 8000.  The GC interp performs
over 1700 allocations for a total of 4MB during that span.  Heap
occupancy remains at about 15% after a full collection.

> 16MB seems like a lot of heap for a small program like this.

For that reason (among others) I repeated the tests with several

- an initial heap of 4MB, instead of 16MB, to increase the frequency of
collection (to about five full mark/sweep collections per second).

- replacement of malloc/free in the RC interp with the equivalents from
the BDW collector, using it essentially as an ordinary heap manager.

- removal of the Tcl_Obj free lists, just in case they are detrimental
to the performance of the RC interp (they aren't used for the GC tests).

None of the changes except the last had much effect (less than 10%) on
the measurements.  Removing the free lists however caused the RC interp
to slow by about 15%.

It seems very difficult to find a fair performance comparison of the
different memory management techniques.

> Ref-counting will probably always be slower than periodic tracing
> (even if you do deferred RC), but I was interested to see if
> RC does really have better cache/VM locality in allocation-intensive
> benchmarks where the tracing collector is invoked more
> frequently and possibly pollutes the cache.

Too frequent collections should be avoided.  Tracing collectors appear
to benefit from an abundance of memory, and are consequently a poor
choice for small-memory configurations.  On the other hand, in none of
my tests did expanding the heap negatively impact performance until
paging sets in.  I'm not really sure why this is so... small heap
configurations that should run mostly within cache perform considerably
worse than large heaps.  (I guess cache architecture could have an
effect on the results.  I haven't adequately tested different machines.)

> Also, it would be interesting
> to see if rapid recycling of garbage by RC does help to improve allocation
> speed/locality or decrease total memory usage (even after possibly leaking
> cyclic structures).

I'd say RC yields the smallest memory sizes by a wide margin.  Even when
tuning the collector for minimum heap, the GC interp needs about twice
as much memory.  (It helps that Tcl has no cyclic references.)

On the allocation speed, I have failed to demonstrate any advantage in
RC.  I am beginning to believe that although RC has the potential for
immediate reuse of heap blocks, many programs do not exhibit enough
memory allocation patterns that foster such immediate reuse to benefit
significantly from it:

- recursive deletes tend to build long free lists that may not be
exhausted for extended durations, after which they are likely to have
their regions evicted from the cache

- many short term allocations can be performed directly on the call

- programmers sometimes identify opportunties for memory reuse and
optimize the explicit alloc/free away in source code.

It might be interesting to instrument the heap manager in some way to
learn the age of free blocks when they are reallocated.

On the topic of free lists, does anyone have a preference for LIFO vs.
FIFO recycling?