[gclist] write barrier via VM traps
Paul R. Wilson
wilson@cs.utexas.edu
Tue, 27 May 1997 11:45:09 -0500
I agree with Marc and Eliot that on standard OS's, trap costs dominate
and scanning times are secondary.
Two relevant pieces of work are Jochen Liedtke's work on the L4
microkernel, which shows how to make traps really really fast,
and Hank Levy (and somebody else at Washinton) has shown simple
ways to make them significantly faster. (One simple trick
is to let trap handlers simply resume execution of the program
that trapped, without returning into the kernel and then
having the kernel resume the program---you only have to go in and
out of the kernel once, rather than twice.
As far as lobbying vendors, Apple has used sub-page protections
already. The Newton guys got the ARM guys to put sub-age
protections into the ARM 6 processors, in support of things
like persistence and compressed virtual memory. Walter Smith
and Bob Welland were the guys behind this at Apple. Walter
is now at Microsoft, but you could probably get a testimonial
from him about the virtues of this idea.
Eliot is all too right about the fact that there aren't enough
empirical studies of this stuff. I think this is really unfortunate
because sub-page protections are useful for a bunch of things, including
efficient checkpointing and adaptive disk prefetching, but all of
those are poorly studied at this point. All of them are very relevant
to Java, too. Java needs persistence, as Sun knows. If it ends
up being used for performance-critical distributed programming,
it will also benefit a lot from efficient checkpointing to cope
with node failures.
Sub-page protections and fast traps would also be very good for
distributed virtual memory, where the hardware page size can be
a huge problem due to false sharing.
So sub-page protections ought to be pitched for multiple purposes,
rather than just GC. Efficient and flexible VM primitives should
let you use the TLB for a variety of advanced things.
WRT GC in particular, the fastest write barrier seems to be
Ungar and Hoelzle's variant of my variant of Sobalvarro's
"card marking" write barrier. They use bytes as dirty bits,
which allows them to mark a card in two or three instructions.
This approach has limitations, though---the bitmap (bytemap,
really) is larger, and it would be advantageous to have a two-level
structure like Sobalvarro's original scheme, IF you can make it
fasy. With very fast traps or access to dirty bits, you could
use the pagewise dirty bits as a coarse-grained map of the
areas to be scanned, with software dirty bits telling you which
small cards to actually scan. (This is too slow if you do both
levels purely in software.)
If you add access to sub-page dirty bits, that would probably
be the best thing---you wouldn't need trap overhead, the unit
to be scanned wouldn't be too big, and you wouldn't need
a software write barrier on each pointer store.