[gclist] ref-counting performance cost

Manoj Plakal plakal@cs.wisc.edu
Fri, 15 Sep 2000 00:48:23 -0500

Hi Jeff,

How memory/allocation-intensive is your Sieve benchmark? It's
a pretty small microbenchmark and artificially jacking up
the number of primes you generate doesn't seem to generate
meaningful numbers. 16MB seems like a lot of heap for a
small program like this. Can you try a large Tcl "application"? 
(if such a beast exists)

Ref-counting will probably always be slower than periodic tracing
(even if you do deferred RC), but I was interested to see if
RC does really have better cache/VM locality in allocation-intensive 
benchmarks where the tracing collector is invoked more
frequently and possibly pollutes the cache. Also, it would be interesting 
to see if rapid recycling of garbage by RC does help to improve allocation 
speed/locality or decrease total memory usage (even after possibly leaking 
cyclic structures).


PS: On a related note, the Python interpreter used reference-counting
    exclusively till the latest release of Python 2.0 which included
    an optional "collect cycles" routine. I'm not sure whether
    they've adapted the BDW collector to do this or whether they
    wrote their own. Python allows extensions written in arbitrary
    languages but I guess they need to enforce restrictions on
    how Python objects get manipulated or how you can store 
    pointers to them, or else memory leaks.

Jeff Sturm wrote (Fri, Sep 15, 2000 at 01:28:06AM -0400) :
> Kragen Sitaker wrote:
> > I asserted that reference-counting is almost always much slower than
> > well-implemented mark-and-sweep or copying collectors; he requested
> > quantitative evidence.  Real quantitative evidence requires
> > measurements of real programs, and doing those is a lot of work.  Has
> > someone done them?
> I haven't seen much recently.  Since the rules have changed a bit on
> modern hardware, I resurrected some experiments I performed earlier this
> year, and gathered a few more stats out of plain curiousity.
> I started with a language that is (in)famous for reference counting,
> Tcl/Tk.  The core of the interpreter is implemented in C using macros
> for increment/decrement.
> I modified the interpreter core to use the BDW conservative collector,
> removing the refcount field from certain structs and changing the macros
> to a no-op.  I also omitted the recursive tree deletion routines and
> applied atomic (pointer-free) allocation for character arrays.
> Immediately I observed a 10-30% runtime improvement in various moderate
> to complex scripts.  I ran some detailed performance stats on one
> benchmark, a sieve of eratosthenes in Tcl, using the hardware
> performance counters of the Alpha CPU:
>               Instruction Counts (in millions)
>              cycles  issues  loads  stores
> original        733     313     51      29
> GC tclsh        548     226     36      25
> Nothing too surprising.  All counts are down because the code is smaller
> and less complex than the original refcount interpreter.  Although the
> mark phase of GC is expensive relative to explicit deallocation, it may
> not occur often enough to be statistically significant.
> I was more interested in cache performance due to the bloated heap that
> is characteristic of GC (I ran the GC test with the heap fixed at 16MB):
>                     Cache Hits/Accesses
>            instruction (16KB)  data (16KB)  external (1MB)
> original            86%             90%           83%
> GC tclsh            89%             93%           85%
> Even though reference counting increases the chance that newly allocated
> pages will be resident in cache (in theory, at least) I haven't been
> very successful in demonstrating that.  (However to be fair I also
> haven't considered the efficiency of the standard Tcl allocator.  The
> free lists used by default could interfere with cache performance, among
> other things.)
> About the only code that really benefitted from reference counting is
> Tcl's list operators, which often check for (refcount == 1) to avoid
> performing a copy.  There is no simple way to do object copy-on-write
> with a GC as far as I can tell.
> Jeff