GC performance - was Re: [gclist] GC topics

David Chase chase@centerline.com
Thu, 22 Feb 96 12:52:42 EST


> Well, I guess I must be dump, or misinformed, so please help me.

Misinformed is my guess :-).

> .. My
> initial idea of a conservative GC was about a system which does
> scan memory, and uses heuristics to determine what might be valid pointers.
> If this assumption is true, it maybe is easy, but seems like very low-level
> stuff to me. How efficient can such a test be ?

The outline (to get you a basic conservative GC with annoying pauses,
and some spurious pointers, not the Boehm-Weiser collector) is:

1. Define a root set.
   a. registers
   b. stack
   c. data and bss

   Harder:

   d. data and bss for shared libraries.
   e. thread stacks and registers.
   f. mmap'd and shared memory.
   
   These are your root objects, each with an address and a size.
   
2. Mark from your objects.  For each 32-bit-aligned quantity p found
   there, you need to decide if it is a pointer.
   
   a. lo <= p < hi ?
   b. pageindex = p >> logpagesize
   c. pageoffset = p & pageoffmask
   c. info = pageinfo[pageindex]
   
   The info may indicate that this is a page full of small objects
   or a part of a single large object.  Or, it may indicate that 
   this is an unallocated page.  For now, assume the small object 
   case.

   d. objno = pageoffset >> info.log_object_size

   At this point, we could be picky about whether or not this is 
   an interior pointer.  In practice, if you are dealing with C/C++,
   it is best to treat interior pointers as valid references.
   
   e. markstuff = somefunctionof(info,pageindex,objno)
   I'm being vague here because I'm not sure what's the best way.
   Old versions of the B-W compiler kept all the mark bits clumped
   together at the beginning of the page containing the object, for
   the small objects, so the address of the bits (2 per object) is
   formed by something like

     (p - pageoffset) + (objno >> 2)

   so you load that, and then shift it right by
   
     ((objno & 3) << 1)

   to get at the proper mark bits.

   You could also look this stuff up from an array.  The point is, 
   it's not costly, especially on modern hardware that can cheerfully
   perform two ALU ops in a single cycle (and which will stall for
   at least 3, more likely 6 or 12, on a cache miss).  This sort 
   of stuff can be helpfully hand-scheduled for certain machines
   to run really well.

   f. If the object is already marked, or already freed, you're done,
   if not, then set the mark bit, and queue the object for later 
   scanning.  If it is pointerfree, of course, don't scan it.

3. When all done, sweep.

This is simplified, but not much.

> It maybe is easy to write, but how easy would it be to optimize ?  What 
> happens with portability ? If I am wrong regarding what such GC systems 
> are, please let me know.

Depends on what you mean by "optimize" and "portability".  Remember 
that I'm also worried about the performance of the entire system, 
and conservative GC is the most friendly GC for a very-optimizing 
compiler (there's still possible problems, but they're most reduced 
for a conservative GC).  Or, to be more concrete, let's suppose that 
I have an application, using precise GC, and GC takes 5% of the total 
time.  By switching to conservative GC, let's say I double the time 
spent in GC, but I get to use an off-the-shelf compiler/optimizer 
for the rest of the code that is 10% faster than what I was using.  
Elapsed time goes from 95+5 = 100 seconds to 85.5 + 10 = 95.5 seconds, 
or a 4.5% speedup.  For throughput-constrained applications, the 
conservative GC is a clear winner. 

Separating pointer-free objects from pointer-containing objects is 
another very helpful optimization.  The "is-this-a-pointer-and-where- 
does-it-point?" test can be optimized somewhat from a scheduling 
point of view (good compilers can do this automatically). 

I'll give an example, slightly scheduled, by which I mean I optimistically 
execute fault-free ALU operations. In this example, I use a mark 
byte, containing at least the following flag values: 

  1 = alloc
  2 = mark
  4 = pointer-cointaining

The "info" flag for a page, if it is less than 12, contains the log 
of the size of the (power-of-two-sized) objects stored on that page.  
Otherwise, it contains some encoding of the larger object containing 
that page, that I haven't gotten around to specifying here. 

do_mark(p) {

  p' = p - lo;
  pageoffset = p & 4095;

  ok = p' /* unsigned*/ < hi
  pageindex = p' >> 12;

  if (!ok) return                       // branch

  t2 = p - pageoffset;
  info = pageinfo[pageindex];          // load, could be a character

  issmall = info < 12;
  objno = pageoffset >> info;
  t1 = (~0 << info);
  
  objoffset = pageoffset & ~ t1;
  mbyteaddr = objno + t2;
  if (!issmall) goto big;              // branch

  mbyte = *(unsigned byte *) mbyteaddr // load
  pbegin = p - objoffset;
  psize = 1 << info;
  t3 = mbyte & 3

  mbyte |= 2;
  if (t3 != 1) return;

  *mbyteaddr = mbyte;                 // store
  if (mbyte & 4)                      // branch
     queue_for_mark(pbegin,psize);

  return;

big:
  /* Using some encoding on the "info", figure out whether
     this is the initial page of a big object or not, and
     whether it is marked or not. */

}

I tried to group the operations into independent clumps, and singled 
out branches and loads as potentially expensive. Obviously, this 
is the sort of thing that you'd like to inline, but I wrote it as 
a subroutine for ease of presentation. My guess is that it could 
dismiss an obvious non-pointer in 3 cycles.  Actual pointers, to 
freed or marked objects, would be dismised in about 8 cycles.  Pointers 
that were marked but not scanned, 9 cycles, and something like 12 
cycles for a pointer to an allocated, unmarked, scannable object. 

The code could be further improved, though substantially complicated, 
by scanning two or four pointers at a time. The idea is to generate 
latency between load operations and uses of the operands, so that 
the CPU has something to do while waiting for the load to complete 
(modern chips do this to some extent, and they'll only do it more 
in the future). A cache miss loading the "info" or the mark byte 
could easily double the cost of this code, so it is not a bad idea 
to attempt to cover the cost of a cache miss.

speaking for myself,

David