[gclist] ref-counting performance cost
Manoj Plakal
plakal@cs.wisc.edu
Fri, 15 Sep 2000 00:48:23 -0500
Hi Jeff,
How memory/allocation-intensive is your Sieve benchmark? It's
a pretty small microbenchmark and artificially jacking up
the number of primes you generate doesn't seem to generate
meaningful numbers. 16MB seems like a lot of heap for a
small program like this. Can you try a large Tcl "application"?
(if such a beast exists)
Ref-counting will probably always be slower than periodic tracing
(even if you do deferred RC), but I was interested to see if
RC does really have better cache/VM locality in allocation-intensive
benchmarks where the tracing collector is invoked more
frequently and possibly pollutes the cache. Also, it would be interesting
to see if rapid recycling of garbage by RC does help to improve allocation
speed/locality or decrease total memory usage (even after possibly leaking
cyclic structures).
Manoj
PS: On a related note, the Python interpreter used reference-counting
exclusively till the latest release of Python 2.0 which included
an optional "collect cycles" routine. I'm not sure whether
they've adapted the BDW collector to do this or whether they
wrote their own. Python allows extensions written in arbitrary
languages but I guess they need to enforce restrictions on
how Python objects get manipulated or how you can store
pointers to them, or else memory leaks.
Jeff Sturm wrote (Fri, Sep 15, 2000 at 01:28:06AM -0400) :
> Kragen Sitaker wrote:
> > I asserted that reference-counting is almost always much slower than
> > well-implemented mark-and-sweep or copying collectors; he requested
> > quantitative evidence. Real quantitative evidence requires
> > measurements of real programs, and doing those is a lot of work. Has
> > someone done them?
>
> I haven't seen much recently. Since the rules have changed a bit on
> modern hardware, I resurrected some experiments I performed earlier this
> year, and gathered a few more stats out of plain curiousity.
>
> I started with a language that is (in)famous for reference counting,
> Tcl/Tk. The core of the interpreter is implemented in C using macros
> for increment/decrement.
> I modified the interpreter core to use the BDW conservative collector,
> removing the refcount field from certain structs and changing the macros
> to a no-op. I also omitted the recursive tree deletion routines and
> applied atomic (pointer-free) allocation for character arrays.
>
> Immediately I observed a 10-30% runtime improvement in various moderate
> to complex scripts. I ran some detailed performance stats on one
> benchmark, a sieve of eratosthenes in Tcl, using the hardware
> performance counters of the Alpha CPU:
>
> Instruction Counts (in millions)
>
> cycles issues loads stores
> original 733 313 51 29
> GC tclsh 548 226 36 25
>
> Nothing too surprising. All counts are down because the code is smaller
> and less complex than the original refcount interpreter. Although the
> mark phase of GC is expensive relative to explicit deallocation, it may
> not occur often enough to be statistically significant.
>
> I was more interested in cache performance due to the bloated heap that
> is characteristic of GC (I ran the GC test with the heap fixed at 16MB):
>
> Cache Hits/Accesses
>
> instruction (16KB) data (16KB) external (1MB)
> original 86% 90% 83%
> GC tclsh 89% 93% 85%
>
> Even though reference counting increases the chance that newly allocated
> pages will be resident in cache (in theory, at least) I haven't been
> very successful in demonstrating that. (However to be fair I also
> haven't considered the efficiency of the standard Tcl allocator. The
> free lists used by default could interfere with cache performance, among
> other things.)
>
> About the only code that really benefitted from reference counting is
> Tcl's list operators, which often check for (refcount == 1) to avoid
> performing a copy. There is no simple way to do object copy-on-write
> with a GC as far as I can tell.
>
> Jeff