GC performance - was Re: [gclist] GC topics
David Chase
chase@centerline.com
Thu, 22 Feb 96 12:52:42 EST
> Well, I guess I must be dump, or misinformed, so please help me.
Misinformed is my guess :-).
> .. My
> initial idea of a conservative GC was about a system which does
> scan memory, and uses heuristics to determine what might be valid pointers.
> If this assumption is true, it maybe is easy, but seems like very low-level
> stuff to me. How efficient can such a test be ?
The outline (to get you a basic conservative GC with annoying pauses,
and some spurious pointers, not the Boehm-Weiser collector) is:
1. Define a root set.
a. registers
b. stack
c. data and bss
Harder:
d. data and bss for shared libraries.
e. thread stacks and registers.
f. mmap'd and shared memory.
These are your root objects, each with an address and a size.
2. Mark from your objects. For each 32-bit-aligned quantity p found
there, you need to decide if it is a pointer.
a. lo <= p < hi ?
b. pageindex = p >> logpagesize
c. pageoffset = p & pageoffmask
c. info = pageinfo[pageindex]
The info may indicate that this is a page full of small objects
or a part of a single large object. Or, it may indicate that
this is an unallocated page. For now, assume the small object
case.
d. objno = pageoffset >> info.log_object_size
At this point, we could be picky about whether or not this is
an interior pointer. In practice, if you are dealing with C/C++,
it is best to treat interior pointers as valid references.
e. markstuff = somefunctionof(info,pageindex,objno)
I'm being vague here because I'm not sure what's the best way.
Old versions of the B-W compiler kept all the mark bits clumped
together at the beginning of the page containing the object, for
the small objects, so the address of the bits (2 per object) is
formed by something like
(p - pageoffset) + (objno >> 2)
so you load that, and then shift it right by
((objno & 3) << 1)
to get at the proper mark bits.
You could also look this stuff up from an array. The point is,
it's not costly, especially on modern hardware that can cheerfully
perform two ALU ops in a single cycle (and which will stall for
at least 3, more likely 6 or 12, on a cache miss). This sort
of stuff can be helpfully hand-scheduled for certain machines
to run really well.
f. If the object is already marked, or already freed, you're done,
if not, then set the mark bit, and queue the object for later
scanning. If it is pointerfree, of course, don't scan it.
3. When all done, sweep.
This is simplified, but not much.
> It maybe is easy to write, but how easy would it be to optimize ? What
> happens with portability ? If I am wrong regarding what such GC systems
> are, please let me know.
Depends on what you mean by "optimize" and "portability". Remember
that I'm also worried about the performance of the entire system,
and conservative GC is the most friendly GC for a very-optimizing
compiler (there's still possible problems, but they're most reduced
for a conservative GC). Or, to be more concrete, let's suppose that
I have an application, using precise GC, and GC takes 5% of the total
time. By switching to conservative GC, let's say I double the time
spent in GC, but I get to use an off-the-shelf compiler/optimizer
for the rest of the code that is 10% faster than what I was using.
Elapsed time goes from 95+5 = 100 seconds to 85.5 + 10 = 95.5 seconds,
or a 4.5% speedup. For throughput-constrained applications, the
conservative GC is a clear winner.
Separating pointer-free objects from pointer-containing objects is
another very helpful optimization. The "is-this-a-pointer-and-where-
does-it-point?" test can be optimized somewhat from a scheduling
point of view (good compilers can do this automatically).
I'll give an example, slightly scheduled, by which I mean I optimistically
execute fault-free ALU operations. In this example, I use a mark
byte, containing at least the following flag values:
1 = alloc
2 = mark
4 = pointer-cointaining
The "info" flag for a page, if it is less than 12, contains the log
of the size of the (power-of-two-sized) objects stored on that page.
Otherwise, it contains some encoding of the larger object containing
that page, that I haven't gotten around to specifying here.
do_mark(p) {
p' = p - lo;
pageoffset = p & 4095;
ok = p' /* unsigned*/ < hi
pageindex = p' >> 12;
if (!ok) return // branch
t2 = p - pageoffset;
info = pageinfo[pageindex]; // load, could be a character
issmall = info < 12;
objno = pageoffset >> info;
t1 = (~0 << info);
objoffset = pageoffset & ~ t1;
mbyteaddr = objno + t2;
if (!issmall) goto big; // branch
mbyte = *(unsigned byte *) mbyteaddr // load
pbegin = p - objoffset;
psize = 1 << info;
t3 = mbyte & 3
mbyte |= 2;
if (t3 != 1) return;
*mbyteaddr = mbyte; // store
if (mbyte & 4) // branch
queue_for_mark(pbegin,psize);
return;
big:
/* Using some encoding on the "info", figure out whether
this is the initial page of a big object or not, and
whether it is marked or not. */
}
I tried to group the operations into independent clumps, and singled
out branches and loads as potentially expensive. Obviously, this
is the sort of thing that you'd like to inline, but I wrote it as
a subroutine for ease of presentation. My guess is that it could
dismiss an obvious non-pointer in 3 cycles. Actual pointers, to
freed or marked objects, would be dismised in about 8 cycles. Pointers
that were marked but not scanned, 9 cycles, and something like 12
cycles for a pointer to an allocated, unmarked, scannable object.
The code could be further improved, though substantially complicated,
by scanning two or four pointers at a time. The idea is to generate
latency between load operations and uses of the operands, so that
the CPU has something to do while waiting for the load to complete
(modern chips do this to some extent, and they'll only do it more
in the future). A cache miss loading the "info" or the mark byte
could easily double the cost of this code, so it is not a bad idea
to attempt to cover the cost of a cache miss.
speaking for myself,
David