[gclist] taking bets on a new collector
Tue, 19 Aug 2003 14:18:08 -0700 (PDT)
> From: "Boehm, Hans" <email@example.com>
> Overall, not having studied this in great detail, I would guess that
> this design will give you appreciably less throughput than our
> collector, even in its incremental mode, but probably with shorter
> pause times.
That's pretty much my bet, too, although I'm hoping it will be fun to
try to give you a run for your money throughput-wise and, who knows,
maybe there will be some emergent optimizations I can do with access
patterns that will make it a real horse race.
Part of the reason I'm interested in persuing this is to consider the
Can the added overhead of enforcing the three-color invariant
in the mutator be leveraged to give the tracer compensatory
In other words, can I use the extra work done by the mutator's write
barrier to make the tracer faster than it would be in a
non-incremental collector? Sufficiently faster that overall
throughput is improved? Taking advantage of properties of caches and
paging are what _might_ make that possible.
> Is there a reason not to sweep incrementally during allocation
> (see e.g. my ISMM 2000 paper)? That tends to reduce the memory
> cost of sweeping from 1 extra cache miss per object to none.
This collector does sweep incrementally, during allocation.
And I hadn't read the paper but I have now. It makes a lot of sense
and is interesting.
> Speaking from experience, there are two distinct problems with
> keeping metadata in the same page with the actual data:
Thanks for this, by the way.
> 2) Changing the color of a pointer-free object touches the page.
> If paging ever becomes an issue, this is likely to be fatal,
> since it will often double the number of pages that the
> collector needs to be memory resident.
That's a troubling observation that I hadn't thought about at all.
This is one of the results from your paper, I see. I'm not _certain_
it will be applicable to my collector.
For example, each GC-bits header in my collector, on a system on which
operating system pages are 1K words, describes objects allocated
across 4 operating system pages. So if there are N operating system
pages containing live, pointerless objects, my tracer will only be
visiting N/4 pages for those. (To be sure, the GC bits in a header
occupy only 1/8 of the OS page on which they reside -- so packing them
into a separate space would dramatically reduce the number of pages
the tracer touches when coloring pointerless objects -- but I'm saying
that in this case, that number of pages is already discounted so that
optimizing it further _may_ be less worthwhile.)
Another example: since this is an incremental collector, aside from
the tracer, the other interesting question is the number of pages that
are examined by the mutator while looking at GC bits. The mutator
pays some markup in operating system pages touched. With integrated
headers, that mark-up is strictly less than with separated headers.
(In the case of my current allocator, it's a bit under than 25% less.)
One optimization I had in mind -- the pseudo-generational hack that I
described -- will help to schedule the tracer so that it more often
opportunisticly follows the mutator in its access patterns.
But, whatever -- it'll be interesting to play around with and thanks
for the tip. It's a relatively small matter for me to modify my
collector to separate mark bits. I'm not going to do that eagerly --
lemme build some more of the surrounding application so that I have
something to measure and then it will make sense to try.
And in terms of the larger system I'm building: my hope is to get the
interface to the allocator/collector Right in the sense that it's easy
to swap out the incremental collector for a different one. When it's
time to write a stop-and-mark/incremental-sweep collector, I now know
what to do!
> 1) Any part of the algorithm that looks mostly at a specific
> piece of metadata tends to perform poorly with low associativity
> caches. In a direct mapped 4K cache, all instances of a given
> per-page metadata field collide. For real caches, the effect is
> generally less severe, but still noticeable.
Gah. Dumb hardware design ;-)
p.s.: modern hardware is _such_ fun:
It is worth pointing out that in our experiments,
the execution time always varies inversely with the
number of instructions executed.
-- Hans J. Boehm