GC performance - was Re: [gclist] GC topics

Paul R. Wilson wilson@cs.utexas.edu
Fri, 23 Feb 1996 18:57:50 -0600


>From owner-gclist@iecc.com Fri Feb 23 13:19:59 1996
>Date: Fri, 23 Feb 1996 11:05:36 PST
>From: boehm.PARC@xerox.com
>Subject: Re: GC performance - was Re: [gclist] GC topics
>To: gclist@iecc.com

[Warning: I came into this discussion late---I only just subscribed to
          this list---so some of the following comments may be off base
          due to lack of context.  In particular, my comments below
          are mostly relevant to precise non-moving GC, not copying GC.]

>Perhaps you should tell us how you keep track of pointers in YAFL.  We are
>mixing together many issues here.  Assuming you are generating C code, I think
>the possibilities are roughly:
>
>1) In the single-threaded case, you could explicitly keep a data structure that
>allows the collector to find all roots.  All evidence I've seen so far suggests
>that this costs on the order of a factor of 2 for pointer-intensive C/C++ code.
>Either pointer assignments and/or pointer variable creation become expensive.
>Much of the cost here is often associated with preventing the C compiler form
>keeping pointer variables in registers.

One trick (which Dave Detlefs has used, if I recall correctly) is to make
two copies of each stack-allocated pointer, one of which can be register
allocated and the other of which can't.  Pointer updates update both copies,
but reads only read from the register-allocatable one.  The hit is a
store instruction per pointer update.  This still doesn't tell you exactly
how to find the non-register-allocated pointers, and a non-conservative
stack scan may mean you have some extra cost at every pointer creation.

One thing I've thought about doing is having the "extra" gc-visible pointer
have a tag field that would reduce the conservatism (and cost?) of a
conservative stack scan.  The stack-allocated pointer field would actually
be two extra fields, one with a magic cookie value and the other the
gc-visible copy of the pointer.  A stack scan would just look for magic
cookies, and then check the next field.  This would cost an extra
instruction at each pointer creation, to initialize the cookie field.
Some optimization might be possible for multiple pointers in the same
stack frame, if you're generating C code automatically.

Has anybody tried this?

>  If you can do this with only 20%
>overhead for pointer-intensive optimized C code, I'm impressed.

Anybody have stats on this?  

For our purposes, it might help significantly to treat the ++ operator
specially.  If a lot of pointer-intensive C code is really just using
++ to iterate through arrays, then we could skip updating the "extra"
gc-visible pointer copy---for a legal C/C++ program, you wouldn't
be stepping outside an object, so just keeping the initial pointer
into the object should work for GC purposes.

We should probably try this with our smart pointer interface for C++.

>  (One
>difference here is that you can optimize in generating the C code.  The other
>experiments of which I'm aware used C++ "smart" pointers, which tend to be
>compiled poorly by C++ compilers.)  This does give you completely portable code
>with "precise" collection.  In the multi-threaded case, this is probably too
>horrible to contemplate.

Smart pointers are definitely a pain.  We use them for C++, but we're not
especially happy with them.  For one thing, you can't have a specialized
overloading based on storage class (e.g., static vs. auto), which makes it
hard to do some obvious and trivial compile-time optimizations.  
(E.g., if you're going to scan the whole stack, you can skip the write
barrier costs for pointer assignments into stack-allocated pointers.)
Another obnoxious thing about C++ is that you can't precisely control
coercions and whatnot to make smart pointers look exactly like built-in
pointers.  C++ is almost good for this, but kind of sucks.  (I wonder
if Ada 95 is better in this regard?)

This is part of the motivation for the "exocompiler" approach we're
developing.  You basically want simple compiler hooks that are a little
cleaner and more powerful than operator overloading and macros, and
then you can do a lot of simple analyses and optimizations outside
the compiler proper, using a compile-time metaobject protocol.

>2) You could try to use debugger information to locate pointers.  Paul Wilson
>has apparently had reasonable results with this sort of approach.  I've
>personally never trusted it enough to rely on it, especially with optimized
>code.  The result wouldn't be portable, but you might avoid the performance 
>hit.

We've never trusted it in general, either.  We don't use debug output to
parse the stack.  We *do* use it to figure out heap layouts of data structures.
This is much more reliable---compiler optimizations generally don't mess
with the layout of heap objects.

For C++, we use smart pointers to do write barrier stuff, but we scan
heap-allocated objects precisely to find the pointers.

This approach is actually more portable than it seems---we can use code
from gdb to parse a variety of debug output formats and give us the info
we need to build type descriptors for use at runtime.  (We use this
for the Texas persistent store as well as our real-time GC.)

(It also doesn't add any runtime overhead---you can compile the program
once with debugging on to get the type info, then recompile with debugging
off to avoid any bloat or interference with optimizations.)

For RScheme, we use a different technique for precisely finding pointers.
We're generating C code automatically, but we're putting in safe points
that are the only places GC can occur (our thread system uses the same
safe points for thread switching).  To a first approximation, the safe
points are at every RScheme procedure call or backward branch, to put
a bound on how long a program can go without checking for a thread
switch or GC.  In between safe points, we can register allocate stuff,
but at safe points everything has to be gc-visible.

This is not cheap, but it buys us a lot of flexibility for strange language
implementation experiments.  We think we can get the cost of safe points
down to a reasonable level by loop unrolling, etc., but we will never
generate the best code in the world.  (Our current goal is an incredibly
extensible and retargetable compiler with good performance, rather than
all-out screaming performance.  Eventually, a full-blown exocompiler
should generate very good code, but it probably won't do it by compiling
to C.)