[gclist] Finalizers & Reference counting.

David Chase chase@world.std.com
Wed, 28 Aug 2002 20:42:34 -0400


At 04:47 PM 8/28/2002 -0400, Reedy,Christopher L. (Chris) wrote:
>One argument being made here is that reference counting is preferable to
>general purpose garbage collection because it achieves significantly
>better cache behavior. (The other point being made is that pointer
>chasing in general has bad spatial locality resulting in bad cache
>behavior.)

Hans has replied to this already, but to make it more clear, whenever
you overwrite a pointer in the heap, you also touch the reference count
for the old object.  Whether you free it or not, you just touched it.

And, again, there is the problem of maintaining reference counts on a
multiprocessor.  Plain-old-counts lose, badly, on Pentium.

In terms of theory versus practice, I have seen a compiler, not gcc,
that spent MINUTES after a particularly large compilation
doing nothing more useful than freeing all the memory in its address
space just before calling exit.  This was a matter of working-set
locality, not cache locality -- it was thrashing the disk.  This
prompted a wee change in how malloc dealt with its management of
very large blocks -- it paid off rather well to thread them
externally, rather than internally.

I have two more datapoints to add.  Playing with a compacting
collector, I experimentally added a single locked-compare-and-swap
for each object forwarded (this was part of an experiment with a
parallelized copying-compacting collector to get a feel for the
baseline -- the idea is to use compare-and-swap to install the
forwarding pointer).  That single operation per object added 60%
to the cost of the entire collection (this on a 2-processor Pentium
III, 800Mhz CPU, 133Mhz memory bus, probably CAS3 memory).  So,
on the one hand, I can see how to make a copying compacting
collector use as many processors as you want running in parallel,
but

1) if it is x86, it had better be Xeon, not Pentium (i.e., they
   had better lock only the relevant cache line, not the entire
   memory bus)

and

2) "atomicity" on a shared-memory-multiprocessor really can be that
   expensive.  If a single pointer assignment (one increment, one
   decrement) costs 120% of forwarding the object through the next
   GC (assuming it would still be live), well....

It is my understanding that load-linked/store-conditional would
normally be cheaper (one typically smallish MP memory systems).

David Chase