[gclist] Fast-allocating non-copying GC's

Paul R. Wilson wilson@cs.utexas.edu
Fri, 15 Mar 1996 17:30:08 -0600


>From majordom@iecc.com Fri Mar 15 13:31:46 1996
>To: gclist@iecc.com
>From: David.Ungar@Eng.Sun.COM (David Ungar)
>Subject: Re: [gclist] Fast-allocating non-copying GC's
>
>At 9:46 AM 3/15/96, David Chase wrote:
>...
>>
>>> BTW, I tried a "customizing allocator" (aka multiple free lists)
>>> in the first version of Berkeley Smalltalk.
>>> My experience with it was so different from Paul and David Chase's that
>>> I think context (e.g. St vs. C, interactive vs. real-time)
>>> must play a very important role.
>>
>>This is weird.  I have gone out and measured this myself, and friends
>>of mine have measured it as well in other applications, and what
>>we've always discovered is that something like 90% of the objects
>>allocated fall into two or three (smallish-object-size) lists.
>>This supports your assertion that header size matters, because an
>>8-byte header is a big deal if the user only request 16 bytes of
>>memory for data.


I was going to say:
  Not necessarily (though I think the assertion is basically true to
  a significant extent.  Short-lived objects don't matter much to
  memory usage---or locality, if you have a roughly LIFO free list
  discipline.

But of course I had assumptions creeping around the back of my head;
I've been thinking about conventional allocators too much, not GC's.

For a GC, it's different, because you don't get any of the space
back until the next GC of the youngest generation.  So all the
headers do sit around taking up space.

As far as I know, though, there's not a very strong correlation between
size and lifetime, except that very large objects may tend to be
particularly long-lived.  That alone may be significant, however.
We should look into this---we have some data lying around waiting
to be analyzed.

If there *is* a positive correlation between size and lifetime, GC's
are losing with respect to immmediate manual reclamation in the sense
that they lose space to headers on small objects for a longer time.

With explicit reclamation, you can allocate as many small objects as 
often as you like and not use much space for them, if you reuse the same
space very quickly.

FIFO reuse is a locality disaster if the interval of reuse is bigger than
the cache (or main memory, viewed as a VM cache) size.  It systematically
defeats LRUish caching strategies because the *oldest* free memory
(probably not touched for a long time) is reused first. 

This brings up an interesting point.  Most GC's reuse memory in a nearly
FIFO manner.  If your reuse interval crosses over the cache size, you're
hosed---at the main memory level, you suddenly start paging as fast as
you're allocating, rather than paging just a little. 

To avoid this, you can reuse memory in a mostly LIFO fashion.  This is
obnoxious if you want to get it absolutely right---e.g., in a copying
collector you'd want to allocate upward through memory on every other
cycle, and downward through memory on the others.  (I think I pointed
this out in some paper or other, and I think somebody else did so
independently.  (Ben Zorn?))

(The idea here is that you reuse the memory you allocated most lately,
so you won't start missing the cache until you've allocated back
through a cache-full of memory that was the last memory you allocated
last time.  If your reuse interval almost fits in the cache, you'll
only miss toward the end of a GC cycle.  The amount of missing is
therefore roughly proportional to the stuff that won't fit in cache,
rather than to the allocation rate.)

Luckily, we don't have to get this exactly right to get the major
benefit.  For a collector like Bartlett's, where memory isn't contiguous,
you can do it on a pagewise basis.  You allocate in pages that have been
freed most lately, but still allocate in the same direction as usual
within that page.  This will give you the basic LIFO effect at the
granularities that matter.  (You'll reuse a cache full of recently-used
pages before you reuse other pages;  this will help at the level of
any cache big enough to hold multiple pages.)

We thought about this a little in designing our treadmillish fake-copying
collector.  One possibility is to allocate through the free list backwards,
but that's tricky to implement without some overhead, so we settled for
something weaker.  When we link a list of reclaimed blocks
into the free list, we might want to  put it at the front of the free
list, ahead of free memory that had been free for longer.   I think we
did that, but I don't yet know how much difference it makes.  

>>  (And this is why I like to use BIBOP whenever
>>possible, and why I think that the language's type system and
>>GC headers should be integrated whenever possible.)
>
>I agree about the lists working w.r.t. object sizes.
>What led me away from the free-list allocator towards the
>contiguous allocator was the speed issue.
>
>With a contiguous allocator, you can keep the
>allocation pointer in a register, and even put its
>limit in either a register or an immediate field of an instruction.
>The new, hot stuff, is all contigous in memory, which helps VM
>and caching behavior--yes I know you can blow the cache anyway,
>but it still seems better than following lists.

This is kind of subtle.  (The following is about conventional allocators,
not GC's.  GC's are even more subtle.  Bleah.)

A good nonmoving allocator gets surprisingly good spatial locality of
memory reuse.  For example, for best fit we have pictures that clearly
show that most objects come free at almost the same time as their neigbors,
due to phase behavior.  This means that you tend to get back relatively
large contiguous blocks.  That in turn means that when you allocate
that memory later, you do it by splitting off consecutively-located
small blocks from a big block (i.e., in address order).  (This is also
why fragmentation for real programs seems to be very low.  Any allocator
that doesn't have good spatial locality of memory reuse is likely to
have bad fragmentation too.)

There are potential gotchas here, though.   Coalescing the small
blocks back together may itself cause misses.  If you allocate more than
a cache-full of blocks before most of them come free, a lot of
them will be out of the cache.  Just coalescing them together
in the usual way will touch all of their headers as they are
freed.  Bad news.  (Of course, if you get the space back quickly,
enough, it doesn't matter---they'll generally still be in the
cache and there's no problem.  But then, in that case there's
no problem anyway.)

Now consider a free list allocator, or a general coalescing allocator
with deferred coalescing.  (That is, most of the time you just use
``quick lists'' in a stacklike fashion, but sometimes you do coalescing
to fight fragmentation.)

This kind of allocator has really good temporal locality---the LIFO
use of conventional free lists ensures that you usually use memory
soon after it's freed.  (Of course, if you use a FIFO discipline, you
make things worse.  How much worse is a very subtle issue.)
This is good, as far as it goes.  (This kind of allocator also tends
to have good spatial locality if the original chopping up of memory
does.)

One potential problem is that when this isn't enough to keep  the
basic miss rate down, the misses you get are *read* misses.  Depending
on the architecture, read misses may be much more expensive than
write misses---or no different at all.  (If you have a write-validate
cache, you can allocate in the cache without reading garbage from
memory.  If you don't, you just fault on the block and read the
garbage in before you can overwrite it.) 

The reason you get read misses is that you're searching free lists,
reading the headers.

You can delay the read misses, if you allocate objects without looking
at their headers---you can allocate a block without looking at
its header, and only look at its header the next time you allocate
from that list.  But *then* you'll take a read miss if you've got
an architecture that's friendly to write misses and didn't load
the whole cache block---you'll miss on the part it didn't load.

One possibility would be to use some kind of weird load instruction
that will load the missing header words in a nonblocking way, so
that it can be overlapped with whatever the application program
is doing to initialize the object.  We're fairly daunted by that
kind of hairiness.

Another possibility is to keep the free list information off to the
side, rather than as headers directly on the objects.  You'll still
take misses on that stuff, but since you can make it contiguous
in memory, you'll get good spatial locality and take fewer misses.
That's pretty hairy to make really fast.  (Once you do that, you
might want to allocate from a bitmap instead of free lists, like Hans
does.)

>Then, when it's time to free those dead young objects,
>you just reuse the space, without overhead for
>doubly-linked lists.

>I guess my disaffection for free lists comes as much
>from a look at reclamation as from the free list part,
>and pertains to a ST/Self/Friendly-compiler
>context.

For the most part, being able to allocate contiguously without
touching the free memory first is advantageous if you have
a write-validate cache.  How important this is depends on a lot
of things:

 * how good your basic locality is.  (If you're not missing much
   anyway, friendliness to write misses isn't an issue.)

 * whether you have a write-validate cache.  Many machines don't.

 * what kinds of data structures and algorithms you use.  (If most
   of your misses are for grinding on big vectors or arrays, it
   doesn't matter much.  Header misses are no big deal.)

 * things like prefetching, cache-to-memory bandwidth, and how
   closely clustered your misses are.  Prefetching may work
   almost as well as write-validate, by fetching stuff before
   its needed without missing on it.  Then again, it may not,
   because it may be hard to issue prefetches far enough ahead
   to do any good, or you may get bandwidth-limited if you're
   fetching too much stuff you don't need.  (E.g., fetching
   whole cache blocks when all you need is the header, because
   the rest of the block is garbage space that's going to
   get overwritten during initialization anyway.)


In trying to pick between basic GC types, you have to remember that
copying GC's are losing in terms of overall misses---they have to
have space to copy into.  Maybe they make that up by having cheaper
misses (write misses) for some architectures.  Maybe they don't.
You usually have to evict something even if you're allocating
in-cache without reading.  If you have enough write bandwidth and
write buffering, that may work fine.  If not, you may lose.

Sliding compaction wins if it helps everything fits in cache, but 
loses bigtime if it doesn't, by making multiple passes over the data, at
least sometimes in FIFO order.  (Maybe the direction of a pass
can be changed trivially, though---e.g., the pointer-adjustment
phase can probably work either way.  That might cut paging
significantly.)

What a mess.

I haven't been able to pick a winner yet on these kinds of grounds.