GC performance - was Re: [gclist] GC topics

boehm.PARC@xerox.com boehm.PARC@xerox.com
Fri, 23 Feb 1996 11:05:36 PST


"As I told you earlier, I do not have this limitation: I can use the most
optimizing C compiler with YAFL, and then again, I believe this is not the
proper way of answering the question: apparently, even you seem to admit
that conservative GC *is* slower than precise GC, and given the liberty
to implement any of thoses systems in a brand new compiler, language
or environment (so you optimizing compiler issue does not apply),
precise GC has clear advantages."

Perhaps you should tell us how you keep track of pointers in YAFL.  We are
mixing together many issues here.  Assuming you are generating C code, I think
the possibilities are roughly:

1) In the single-threaded case, you could explicitly keep a data structure that
allows the collector to find all roots.  All evidence I've seen so far suggests
that this costs on the order of a factor of 2 for pointer-intensive C/C++ code.
Either pointer assignments and/or pointer variable creation become expensive.
Much of the cost here is often associated with preventing the C compiler form
keeping pointer variables in registers.  If you can do this with only 20%
overhead for pointer-intensive optimized C code, I'm impressed.  (One
difference here is that you can optimize in generating the C code.  The other
experiments of which I'm aware used C++ "smart" pointers, which tend to be
compiled poorly by C++ compilers.)  This does give you completely portable code
with "precise" collection.  In the multi-threaded case, this is probably too
horrible to contemplate.

2) You could try to use debugger information to locate pointers.  Paul Wilson
has apparently had reasonable results with this sort of approach.  I've
personally never trusted it enough to rely on it, especially with optimized
code.  The result wouln't be portable, but you might avoid the performance hit.

3) You could use a partially or fully conservative approach.  The collector
would run a bit more slowly.  If we assume that you generate essentially all
the code, and hence you can easily limit the static data areas scanned by the
collector, and allocate pointerfree objects suitably, the penalty for the
conservative stack scan is minimal, certainly much less than 20%.  If you
wanted to limit the possibility of false references you could also supply
layout information for heap objects, but that wouldn't seriously affect typical
performance.  What you lose is the ability to allocate from a contiguous free
area.  The crucial gain is that you don't need to tell the collector which
registers, etc. contain pointers at what points.

If you are generating code directly, you have the added option of statically
generating descriptors for PC ranges so that the collector can find pointer
variables in registers, etc.  This has less run-time cost than the first
option.  It has some other disadvantages, though:

1. The size of these tables is not completely trivial if you assume collections
are restricted to a small set of program points.  (See the PLDI '92 paper by
Diwan, Moss, and Hudson.)

2. Depending on the expressiveness of the language used in your table, compiler
optimizations will be slightly restricted.

3. It's hard to support asynchronous collection at arbitrary program points.
Getting around this in a multi-threaded system will cost you portability and/or
runtime overhead.

4. You can't use a standard, well-tuned, compiler back-end.  You have to build
your own.  This is probably the most serious problem.  I believe it's the main
reason SRC Modula 3 still uses a conservative stack scan.

It's also worth repeating that the "predictability" you get from precise
collection only materializes if the rest of the system defines what objects are
accessible when.  Very few existing systems do, even if they have a "precise"
collector.  This significantly constrains your implementation, and often incurs
an added performance cost.

Hans