[gclist] ref-counting performance cost

Manoj Plakal plakal@cs.wisc.edu
Tue, 5 Sep 2000 03:05:27 -0500

But this allocation cost would also be paid in a non-compacting 
or conservative mark-and-sweep, not just reference counting
(i.e., it is not a unique problem to RC unlike the problems
of space, time and cycles).

To quote some advantages of RC from the GC Book of Jones & Lins:
    - naturally incremental algorithm, no need for a disruptive 
      global reachability analysis (which people have been spending
      ages to optimize), hopefully this improves cache/VM behavior

    - work done is proportional to number of mutator operations
      rather than size of live data 

    - immediate reclaiming of garbage and finalization

    - recycling of recently allocated memory for better locality
      (and perhaps lower allocator high-water marks)

    - insensitive to heap-residency (tracing algorithms can thrash
      with increasing heap usage and may require tuning or adaptivity
      of some sort)

I'll add some more  (though these are a little doubtful):

    - can be augmented with dead-variable information from compiler
      or programmer to reclaim garbage earlier
    - A friend who implements tracing collectors asks why bother
      reclaiming garbage "all the time" by checking counts at
      every pointer store when one can just use a tracing collector
      and trigger it occasionally at allocation points only since
      there's no point in freeing faster than the program allocates. 

      For a single thread this is fine, but in a multi-processor/thread
      environment, it might make sense to free faster than the
      allocation rate of a single thread since another thread might
      be able to use that memory. 

      Of course, there is the small problem of the naive multi-threaded RC 
      (e.g., DeTreville's Modula-2+ concurrent RC collector) being 
      horribly inefficient. But one can try to play tricks to make a
      concurrent RC collector for multiple mutator threads that avoids 
      unnecessary synchronization and not suck.


Simon Peyton-Jones wrote (Tue, Sep 05, 2000 at 12:27:31AM -0700) :
> There is a fourth cost, namely the effect on allocation cost.
> Reference counting can't do compaction, so allocation has to be 
> done off a free list.  To reduce fragmentation one typically has several
> freelists, for objects of different sizes.  So the allocation code has to
> see if there's enough space on the exact-fit list, and if not try some other
> source of space.  As a result, allocation is typically done by an
> out-of-line
> call.   (Correct me if I'm wrong here.)
> In contrast, if there's a single contiguous allocation area, allocation
> consists
> only of writing the contents of the object, all in-line.  10 words, 10
> instructions.  One
> has to check that there's enough space, but that only needs to be done once
> per basic block, so the cost is amortised over all the allocations.
> Furthermore,
> if several objects are allocated, their addresses are all implicitly held by
> the
> allocation pointer (e.g. hp+10, hp+30, hp+38) which reduces register
> pressure
> somewhat.
> Finally, there's the locality issue.  Contiguous allocation improves
> locality.
> Compaction improves locality.  On the other hand, reference counting 
> re-uses hot addresses.  So it's entirely unclear to me what the tradeoff
> in cache effects is.
> For some reason the effect on allocation cost is never mentioned in papers 
> about reference counting, but I believe it is fairly large.  I'd be
> interested in any direct
> measurements people have made.  I don't have any myself.
> Simon
> | There are three problems with naive reference counting.
> | 
> | 1.  Inability to clean up cycles.
> | 
> | 2.  The time taken to do the reference counting operations.
> |     A simple
> | 	x <- y
> |     turns into
> | 	ref(y)++
> | 	if (ref(x)-- == 0) bury(x)
> | 	x = y
> | 
> | 3.  The space required for the reference counts.  In a Lisp-like
> |     system, a cons cell bloats from 2 pointers to 3.