[gclist] Finalizers & Reference counting.

Reedy,Christopher L. (Chris) creedy@mitretek.org
Wed, 28 Aug 2002 16:47:27 -0400


I'm sure you've seen the e-mail from Linus Torvalds on the GCC mailing
list about reference counting versus garbage collection. For those who
may not have seen it, here is a quote from the e-mail:

>  I strongly suspect that what makes gcc slow is that it has absolutely
>  horrible cache behaviour, a big VM footprint, and chases pointers in
>  that badly cached area all of the time.

>  And that, in turn, is probably impossible to fix as long as gcc uses
>  garbage collection for most of its internal memory management.  There
>  just aren't all that many worse ways to f*ck up your cache behaviour
>  than by using lots of allocations and lazy GC to manage your memory.

>  The problem with bad cache behaviour is that you don't get nice
spikes
>  in specific places that you can try to optimize - the cost ends up
being
>  spread all over the places that touch the data structures.

One argument being made here is that reference counting is preferable to
general purpose garbage collection because it achieves significantly
better cache behavior. (The other point being made is that pointer
chasing in general has bad spatial locality resulting in bad cache
behavior.)

As a programmer I'm a big fan of the way GC makes my job a lot easier.
However, the notion that GC has significant inherent performance
problems bothers me. I'm curious whether anyone has any data and/or any
experiments that might confirm or deny this claim.

  Chris

> -----Original Message-----
> From: Boehm, Hans [mailto:hans_boehm@hp.com]
> Sent: Tuesday, August 27, 2002 6:24 PM
> To: 'Jerrold Leichter'; Boehm, Hans
> Cc: 'David Chase'; gclist@iecc.com
> Subject: Re: [gclist] Finalizers & Reference counting.
>
> > -----Original Message-----
> > From: Jerrold Leichter [mailto:jerrold.leichter@smarts.com]
> > ...
> > I just don't think you can pick one particular factor (like
> > the cost of
> > synchronization) and say "Aha!  RC loses!"  The tradeoffs are
> > complex, and
> > this is just one more factor to consider.
>
> I agree.  But I still think it's a significant factor
>
> I think we really need to distinguish between "simple RC" for which
all
> reference count operations are performed synchronously by the mutator,
and
> more sophisticated deferred RC schemes such as
>
> Bacon, Attanasio, Lee, Rajan, Smith, "Java without the Coffee Breaks:
A
> Nonintrusive Multiprocesssor Garbage Collector"
>
> I think it's fair to say that for typical Java programs the best
deferred
> schemes are at least in the same ballpark as tracing collectors.  They
> seem to be slower in terms of throughput, but can get very low pause
> times.  I think they're worth examining closely, and they may win in
some
> situations, though based on the published results, I would usually
still
> prefer a tracing collector.  They also share a lot of performance
> characteristics with tracing collection.  In particular, you don't get
> simple timing guarantees for acyclic data structures.
>
> Simple synchronous reference counting seems to work well for very
large
> objects in acyclic data structures, e.g. in file systems.  It also
> sometimes pays off if you can simultaneously use it for things other
than
> memory management, e.g. avoiding copies.  And it may reduce space
> consumption.  But as a general memory management technique for small
> objects, it suffers from many problems:
>
> 1) Cycles.
> 2) Large pause times.  It's generally far more expensive to deallocate
a
> data structure comprising most of the heap an object at a time than it
is
> to trace the heap.
> 3) If you have finalizers/destructors, you need to explicitly defer
> finalization to get correct semantics.  I maintain that
implementations
> that run finalizers/destructors synchronously (all of the C++ ones?)
are
> just plain wrong, at least in the sense that they greatly complicate
the
> programmers job, and no real programmer will do the work to ensure
that
> the results are correct.
> 4) Large mutator overhead, especially for multithreaded code, even if
it
> doesn't allocate during a phase.
>
> My impression is that for a multithreaded application operating on
small
> objects, the cost of atomic operations doesn't always matter, but it
> ensures that you almost always lose on time performance.  If a piece
of
> code runs within the cache, the atomic operations will slow it down by
a
> large factor, since you've added 40-400 cycles to every pointer
> assignment, and your code used to run fast enough that this is a big
deal.
> (Based on my experiments, there is basically nothing else you can do
> during these operations, so clever scheduling doesn't help much.)  If
it
> doesn't fit in the cache, the reference-counted version will almost
> certainly greatly increase the number of cache misses, since pointer
> assignments now touch both the old and new referenced object.  In this
> case, even the single-threaded version will take the hit, though the
cost
> of the atomic operations is much less of an issue.
>
> Unfortunately real measurements are hard to come by.  But the above
paper
> makes many major improvements to the simple RC algorithm and gets
> uniprocessor runtimes that are still appreciably slower than tracing.
(It
> does include a cycle collector, which affects the cost.)
>
> Hans

This is informal and not an official Mitretek Systems position.
Dr. Christopher L. Reedy, Senior Principal Software Engineer
Mitretek Systems, 3150 Fairview Park Drive South, Falls Church, VA 22042
Email: creedy@mitretek.org  Phone: (703) 610-1615  FAX: (703) 610-2203