[gclist] precise access barrier with hardware dirty bits

Paul R. Wilson wilson@cs.utexas.edu
Thu, 8 Jul 1999 15:45:19 -0500

>From majordom-gclist-out-owner-wilson=cs.utexas.edu@iecc.com Thu Jul  8 12:26:38 1999
>Date: Thu, 8 Jul 1999 19:21:34 +0200
>From: Francois-Rene Rideau <fare@tunes.org>
>To: Garbage Collection mailing-list <gclist@iecc.com>
>Subject: [gclist] precise access barrier with hardware dirty bits
>Dear all,
>   I had an idea last night that I wanted to share with you GC experts.
>The idea was to use dirty bits of paging hardware to achieve _precise_
>read or write barriers for a GC or persistent store,
>by using one logical page mapping by logical object on a same (physical) page?
>i.e. a same physical page P1 could contain three logical objects A, B, C,
>at different offsets, and would accordingly be mapped three times
>at different addresses logical virtual memory. Assuming your language
>is strongly typed and your implementation is otherwise correct,
>you could then track down _precisely_ at a _fine grained_
>which of the objects on the page is being read or modified,
>despite the hardware having only page-level granularity!

Funny you should mention this... I was just discussing this idea
with someone yesterday.  I've been thinking about it for several
years but haven't found a compelling argument for it.  That
doesn't mean there isn't one, by any means---but I think you generally
need to do it in combination with some other constraint to see
how it's worth it.

When you use page control, you usually need to do it on a large
granularity for it to be a win---you've got to amortize the
trap cost at least across several accesses to an object (no problem
here) and preferably across accesses to several objects.  If
you can't do this amortization, you lose big compared to fine-grained

If you're doing this with individual small objects (e.g., a few words),
you've got a problem if you can't afford to trap on each one separately.

An intermediate strategy is to use this trick to fake smaller pages;
e.g., divide each page into 4 pieces, and use each quarter through
a different VM page.   That'll let you adjust granularity tradeoffs,
but in the extreme case of one small object per page, it's expensive.

For a few applications, even this is quite acceptable.  Suppose, for
example, you're trying to generate an object reference trace, and you 
put each object in its own virtual page, access protect all of the 
pages, and use the order you trap on them to tell what order you touched
the objects in.  (Your page fault handler just writes the faulted-on
page number to a trace file.)   Your program may run 100 times slower
due to the trap overhead, but you may be willing to run a program for
several days to get a trace you'll use in a bunch of experiments.

(We do essentially this to get page-level reference traces for studies
of compressed caching and whatnot.  We don't separate the objects,
but this trick would let you do that, too.  Either way, you can
keep the cost from being really enormous by leaving a small number
of pages unprotected at any given time, and managing a FIFO queue
of unprotected pages.  This means that you won't notice any
touches to a page after it's been trapped on and unprotected,
but as long as you re-protect the pages after a while, that's
OK---you'll at least notice the first touch to a page over any
reasonable time period, which is what's critical for caching

Another application where it's probably acceptable is a debugging
allocator.  I believe there's a debugging allocator for Linux
that puts many objects by themselves on pages, with unmapped
or protected pages in between.  This is especially useful
for detecting array bounds overruns---you put each array
at the end of a page, with an unmapped or protected page
after it, so that as soon as you run past the end of the
array, you get a fault, and can map backwards to the array
you overran.  (To detect overruns walking backward off an
array, you could do a different run with the arrays at the
beginnings of pages, and with unmapped pages ahead of them.)

This is acceptable for a debugging allocator, because it's
generally OK to have things run slower when you're debugging,
but is not cool in production code except for small arrays.


Hans is right about TLB misses being a potential performance
problem---the more virtual pages you touch, the more TLB
entries you touch, and the more likely you are to thrash
your TLB and get some significant slowdown.

It can be worse than that, though, if your operating system
is stupid.  

A while back, Linux got pretty slow for things that did
a lot of sparse memory mapping.  Linus didn't see why anybody
would need a whole lot of differently-mapped and/or differently-
protected regions in one process, so he took out the AVL-tree
memory mapping structures and replaced them with linear lists. 
Then some people howled that their debugging allocator
now made programs run several times slower than before, so
they decided to put the AVL trees back.

Many years ago, Mach had this problem.  (I don't know if it
does now, and certainly hope not.)  It's vmap data structure
was just a linear list.  I complained about this to one
of the Mach folks, who said they weren't worried about
it because most unix processes only have about 5 segments,
or maybe a couple more because of a memory-mapped file
or two.  I told them that if I just wanted to run UNIX
programs, I wouldn't need Mach at all!  (Why bother to
have all that fancy VM control if it costs too much to use?)


Anyway, I've been thinking about this sort of thing a lot
lately, because we're builting a virtualization library
for user-level virtual memory primitives.  The idea is
to multiplex the actual page protections and handlers,
so that you can have different libraries that map and
protect memory, and then you can mix and match them
and they won't conflict---e.g., you can use page protections
for checkpointing, or pointer swizzling, or virtual
memory tracing, or whatever, and combine different libraries
that use them in different ways.   Equally important,
you can use such libraries with programs that use memory
protections themselves, which has been a problem in the

This is something I've been wanting to do for years and
years, and we finally got around to figuring out how
to do it.

When it's fully cool and working, this should let you do
some spiffy things---like:

   1.  stacking a checkpointer on top of a tracer 
       to trace the checkpointing as well
       as the application execution, or 
   2.  stacking the a tracer on top of the checkpointer,
       the tracing of the application as well as the application's
       own execution.
   3.  stack a checkpointer on top of a tracer on top
       of a checkpointer, so that you can checkpoint
       the whole process of tracing an app that uses

You can think of each library that plugs into our
virtualization framework as a function that takes
a memory model and implements another memory model
in terms of it---the checkpointer runs on top of 
a memory image, and exports a memory image that 
traces itself when you run a program in it.  

Alternatively, you can think of it as modifying a program
to checkpoint itself, then modifying that program to
trace _itself_, and then modifying that program to
checkpoint itself.  The "modification" is all done
with interposition tricks and you don't modify the
actual code at all.

This ends up looking like stackable virtual machines,
but all at the user level within a process, and with
a framework for layers to collaborate---e.g., so that
a checkpointer can coordinate with a GC to avoid 
checkpointing garbage.

So far, all we've got built is a monolithic tracer that
uses two levels of protections---its own and the
applications.  But figuring out how to do that 
shows how to do the more general things.

It should be pretty straightforward to generalize the
two levels to n levels and provide a plug-in
interface for various libraries that use VM tricks.
Then we can have stackable/composable memory management
libraries, for lots of things:  checkpointing, swizzling,
compressed VM caching, GC's, tracing, distributed shared
memory, debugging allocators, etc., etc.

Your object-separator could then be another library
that plugs into our framework, and could be used in
combination with various other kinds of things to
get you lots of different compositions with different
semantics and performance characteristics---e.g., layer
an object-separator on a distributed shared memory
to reduce false sharing, etc.