[gclist] Fast-allocating non-copying GC's

Paul R. Wilson wilson@cs.utexas.edu
Mon, 18 Mar 1996 10:16:51 -0600


>From majordom@iecc.com Mon Mar 18 01:23:04 1996
>Date: Mon, 18 Mar 96 00:12:07 GMT
>From: geert@sun3.iaf.nl (Geert Bosch)
>To: gclist@iecc.com
>Subject: Re: [gclist] Fast-allocating non-copying GC's
>Precedence: bulk
>Reply-To: gclist@iecc.com
>
>Dear ``garbage collectors'',
>
>After some time of just reading this list, I think it's time to fire
>my share of stupid questions. Currently I'm working on implementing
>a GC-friendly allocator for Ada-95 (a language in the ALGOL family).
>Below I'll give a basic outline of how I want to implement the
>allocator. This posting is rather long, I'm sorry.

You may want to use our GC.  It has a lot of the properties you're
looking for, and the others can be added.  I think it would be an excellent
choice for Ada, since it's a real-time GC and many Ada users are 
real-time types.

We'll give it away under the GNU General Public License (or probably the
less restrictive GNU library license, or something even less restrictive
like a TCL-style license, so people can use it in commercial apps).

You might also consider Hans Boehm's collector, or Joel Bartlett's.


>   If anybody would like to comment on this, I'd really 
>welcome those comments.
>
>Assumptions:
>   - Changes to the compiler are hard or even impossible

  I wonder.  Maybe Hosking, Moss et al.'s changes to the GNU compiler
  will be applicable to Ada 95?   Either way, the same GC design
  will probably be attractive if you want real time.

>   - Implementation will be 'proof of concept' of the
>     used algorithms, so
>   - Portability is not an issue: the allocator should
>     work for a 32-bit computer with VM pages of 4 kB 
>     running the GNAT Ada compiler

Our GC is like this now, though the page size can be changed trivially.

>   - The algorithm should allow GC to proceed concurrently
>     with the mutator, in the presence of OS support for
>     efficient read/write barriers (using VM techniques)


>The choices I had to make, with my arguments:
>
>(1) Reference counting versus Mark and Sweep
>
>    Reference counting only works for certain types of objects
>    without requiring changes to the compiler. Besides that 
>    handling cyclic garbage is too hard, so no reference-
>    counting for me.

I think you mean "Reference counting vs. Tracing."  Here.  Mark-sweep
is just one kind of tracing.

>(2) Precise versus Conservative GC
>
>    Since changing the compiler (GNAT) in a way that makes 
>    precise GC possible is way too hard and time-consuming,
>    I have chosen for a conservative GC.
>
>    Consequences:
>    
>    (2.1) => There must be a way to quickly determine if a 
>             value can possibly be a pointer.
>
>    (2.2) => Given some pointer (which can possibly point into 
>             an object) it must be possible to determine the 
>             bounds of the object.
> 
>         (?) In Ada only a very small subset of the objects
>             can contain pointers that point inside objects.
>             The compiler knows about these cases. Would
>             it be worth the effort to have the compiler 
>             pass this info to the allocator at allocation 
>             time?

Our GC is suitable for this.  It doesn't currently do conservative collection
(we use precise collection via compiler cooperation for RScheme, and via
a smart pointer interface for C++), but all you'd have to add would
be the conservative scan.  It already handles interior pointers, etc.

Our collector also support precise scanning of objects on the heap,
WITHOUT modifying the compiler.  We have a tool that extracts debug
output from object files and uses it to construct type descriptor records.
(You can then recompile the program without the debugging option, so
there's no runtime space or time hit for having debug output in your
executable.  This is easily automated in the makefile.)  The main version
of this tool uses code from GDB, so it understands any compiler debug
format that GDB understands.  I think it could be adapted to GNAT
trivially.

It would be easy to add conservative scanning, especially if you snarfed
the code from Hans's collector.

>(3) Copying/moving versus non-copying/moving
>
>    Using conservative GC implies using a non-copying 
>    and non-moving allocator. If this is not true, I'd 
>    like to find a counter-example. 
>    (There must be some exception ;-)

Yes, Joel Bartlett's mostly-copying GC.  The idea is that you mostly
copy data, but don't copy the objects that you can't copy.  For this
to work, you want precise scanning of objects in the heap, so that
you know where most pointers are and can copy their referents.

As far as I know, Bartlett's GC currently requires the programmer to
supply methods to scan each kind of object.  Our type descriptor tool
would fix that nicely.  If anybody wants to use it with Bartlett's
GC, let us know.

Anyway, Bartlett's GC just conservatively scans the stack and globals
to find possible ("ambiguous") roots.  Those objects are "pinned"---their
addresses can't change.  Other objects can be relocated freely.

>(4) Because of (2.1) and (2.2) I've chosen to have a 
>    table in which I can lookup any possible pointer 
>    and get its size and some other attributes which
>    I'll describe in the next section.
>
>    The table has one entry per VM-page of 4 kB, so
>    the overhead for this table is 4 bytes per VM page 
>    of 4 kB. (Sparse allocation is used for this table.
>    For a typical application using 10 MB of memory, only
>    12 or 16 kB RAM is needed for the table)
>
>    (4.1) => A result of this is that objects in the
>             same VM-page must have the same size.

This is pretty much how our GC currently works, but it'd be nice to support
a better allocator.  

Segregating objects by size has disadvantages for fragmentation.  The
best allocators (fragmentation-wise) coalesce adjacent free blocks
into large blocks.  I think this can be done without violating the
constraints needed for interior pointers, and fairly efficiently,
but it's hairier.

(See our allocator survey on our web site for more info on basic
allocator design, and some experimental data.)

>Small storage-pools
>~~~~~~~~~~~~~~~~~~~
>The main storage pool is not suitable for small objects.
>Because object sizes are rounded up to the nearest integral
>number of VM-pages, there is much internal fragmentation.
>
>Objects which have more than 25% internal fragmentation when
>directly allocated using the main storage pool are allocated
>from small storage pools. Small storage-pool types are 
>descendants of the main storage-pool type. Note that only
>the pool types are represented by data-structures (Ada tagged
>records). The only per pool data is the 32-bit of information
>in the page-info table.
>
>Pool-selection
>~~~~~~~~~~~~~~
>Objects larger than 3 * VM_Page_Size are always allocated
>directly from the Main storage-pool. For smaller objects 
>a table is used to determine the storage-pool type. The
>actual allocation is done using a dispatching call to the
>pool-type.
>
>Consequences
>~~~~~~~~~~~~
>Positive
>~~~~~~~~
>+ Easy to define custom storage pools for
>  storing pointerfree data, data which can
>  be scanned precisely, data referenced
>  by external pointers etc.
>+ Objects can essentially be 'header-less'
>+ A per object-size freelist can be used,
>  for fast allocation
>+ Implementation is easy
>
>Negative
>~~~~~~~~
>- Locality of reference: different objects allocated
>  at about the same time, might be in different storage pools.

  This is probably much less of a problem than it seems at first glance.
  There's a non-obvious granularity issue here, and I think that segregation
  is probably *good* for locality, all other things being equal.
  (Unfortunately, they're not equal;  coalescing is also good because
  fragmentation is bad for locality.  What you'd probably like to have
  is an allocator that segregates heuristically, but coalesces too.)

  Remember that you have many hundreds or thousands of pages of main
  memory on a reasonable computer.   Suppose you have several pages,
  or perhaps many pages, of each important object size.  As long as
  the objects of each size are well clustered, it usually doesn't
  matter whether they're clustered separately or together.

  For example, suppose I have a data structure that consists of
  a bunch of objects of type A, and a bunch of objects of type
  B.  The A's might be tree nodes in a tree, holding pointers to
  the B's.  If I have 10 pages worth of A's and 12 pages worth of
  B's, and each is well-clustered, I have the data structure clustered
  into 22 pages.  If I have it clustered in an intermixed way,  it
  still takes up 22 pages.

  The overall data structure may have locality properties that are
  favored or disfavored by segregating its components by type.
  For example, if I don't use the tree very often, and it's not
  ever mostly in memory, it may be good to segregated the tree nodes
  (A's) from the B's.  That way I can search a well-clustered tree and 
  only take one more fault when I access the B I was looking for.
  (Then again, maybe not---do I access a bunch of B's during a search,
  to look at their keys?  Depends on whether my tree nodes store keys,
  or just pointers to objects holding their own key info.)

  The reason I think that segregation is likely to pay off is that
  when you create a bunch of objects of different types, you may
  have only objects of one or two types survive to be part of some
  longer-lasting data structure.  If you interleave objects of
  different sizes during allocation, you may lose, because you'll
  be spacing out those objects that end up being part of the same
  structure.  What goes in between *may* be fine---it may be another
  kind of object that's part of the same data structure, or a data
  structure that will be used at the same time---but it may not, in
  which case you may lose big.

  My guess is that separating objects of different sizes usually doesn't
  hurt, but interleaving them *may* hurt a lot.  Not enough data yet
  to be sure.

>- I can see really horrible fragmentation scenarios:
>     (Imagine this on a machine with 8 MB of free 
>      physical RAM for the application, and a 256 kB
>      cache)
>
>    1. Allocate 7 MB worth of objects
>       with a size of 64 bytes.
>    2. Free most of them, keeping objects nr N,
>       where (N * Object_Size) mod VM_Page_Size = 0.
>    3. Repeat (1) and (2) dividing the object size by 2 until
>       it's 4 bytes
>    4. Now we've 222208 bytes of living objects left, 
>       taking up 35 MB of RAM
>    5. Now the application starts accessing these remaining
>       objects randomly.

This is the worst case.  I don't think it's ever been observed, but
the data are sketchy.

>Allocating new objects on 'old' pages first would be good to 
>help prevent fragmentation problems as described above. Old 
>pages will live long anyways, so the gaps in it are essentially
>``free''. A young page might die quickly when new objects are
>allocated on old pages first.

This may not be a good idea.  If the old objects *do* die, you've gone
and put other live objects in them, which may *not* die.  If you
left the old pages alone, then you'd get the whole pages back when
the old objects die. 

Also, it may have seriously weird locality consequences.  By mixing
objects created during different phases, you may end up diluting
your locality---when you need the old objects in memory, you have
to bring in the young ones, and vice versa.

>The point is that I don't know how bad the fragmentation problems
>will be in practise with an allocator that groups objects of the
>same size. If somebody has information about the trade-offs between
>the different kinds of fragmentation and locality of reference, I'd
>like to know.

Yeah, me too :-).

At this point, this stuff is a black art.  I hope we'll have a bunch
of fun new data Real Soon Now, though.

>Thanks for reading, anyway!
>
>   Geert Bosch
>