[gclist] ref-counting performance cost
Tue, 5 Sep 2000 03:05:27 -0500
But this allocation cost would also be paid in a non-compacting
or conservative mark-and-sweep, not just reference counting
(i.e., it is not a unique problem to RC unlike the problems
of space, time and cycles).
To quote some advantages of RC from the GC Book of Jones & Lins:
- naturally incremental algorithm, no need for a disruptive
global reachability analysis (which people have been spending
ages to optimize), hopefully this improves cache/VM behavior
- work done is proportional to number of mutator operations
rather than size of live data
- immediate reclaiming of garbage and finalization
- recycling of recently allocated memory for better locality
(and perhaps lower allocator high-water marks)
- insensitive to heap-residency (tracing algorithms can thrash
with increasing heap usage and may require tuning or adaptivity
of some sort)
I'll add some more (though these are a little doubtful):
- can be augmented with dead-variable information from compiler
or programmer to reclaim garbage earlier
- A friend who implements tracing collectors asks why bother
reclaiming garbage "all the time" by checking counts at
every pointer store when one can just use a tracing collector
and trigger it occasionally at allocation points only since
there's no point in freeing faster than the program allocates.
For a single thread this is fine, but in a multi-processor/thread
environment, it might make sense to free faster than the
allocation rate of a single thread since another thread might
be able to use that memory.
Of course, there is the small problem of the naive multi-threaded RC
(e.g., DeTreville's Modula-2+ concurrent RC collector) being
horribly inefficient. But one can try to play tricks to make a
concurrent RC collector for multiple mutator threads that avoids
unnecessary synchronization and not suck.
Simon Peyton-Jones wrote (Tue, Sep 05, 2000 at 12:27:31AM -0700) :
> There is a fourth cost, namely the effect on allocation cost.
> Reference counting can't do compaction, so allocation has to be
> done off a free list. To reduce fragmentation one typically has several
> freelists, for objects of different sizes. So the allocation code has to
> see if there's enough space on the exact-fit list, and if not try some other
> source of space. As a result, allocation is typically done by an
> call. (Correct me if I'm wrong here.)
> In contrast, if there's a single contiguous allocation area, allocation
> only of writing the contents of the object, all in-line. 10 words, 10
> instructions. One
> has to check that there's enough space, but that only needs to be done once
> per basic block, so the cost is amortised over all the allocations.
> if several objects are allocated, their addresses are all implicitly held by
> allocation pointer (e.g. hp+10, hp+30, hp+38) which reduces register
> Finally, there's the locality issue. Contiguous allocation improves
> Compaction improves locality. On the other hand, reference counting
> re-uses hot addresses. So it's entirely unclear to me what the tradeoff
> in cache effects is.
> For some reason the effect on allocation cost is never mentioned in papers
> about reference counting, but I believe it is fairly large. I'd be
> interested in any direct
> measurements people have made. I don't have any myself.
> | There are three problems with naive reference counting.
> | 1. Inability to clean up cycles.
> | 2. The time taken to do the reference counting operations.
> | A simple
> | x <- y
> | turns into
> | ref(y)++
> | if (ref(x)-- == 0) bury(x)
> | x = y
> | 3. The space required for the reference counts. In a Lisp-like
> | system, a cons cell bloats from 2 pointers to 3.