[gclist] More questions for the FAQ

Paul R. Wilson wilson@cs.utexas.edu
Fri, 22 Mar 1996 08:55:26 -0600


>From majordom@iecc.com Fri Mar 22 05:15:01 1996
>From: Nick Barnes <nickb@harlequin.co.uk>
>
>Save needs to be "very fast" in user interface terms. i.e. less than
>about half a second. The disk access alone is going to be at least ten
>milliseconds. These days, that's hundreds of thousands of processor
>cycles. Most Word documents are only a few tens of K. This particular
>M$ decision is bad to the point of being farcical.
>
>Nick Barnes

I basically agree, but I think the issue is a little more subtle.  
This is actually a nice example of trying to combine fault-tolerance
features (checkpointing) and fault prevention featurs (GC).
I think its an example of the kind of thing that will come up often
in trying to make large-scale fault-tolerant GC's.

The interesting thing about this example is that it involves long-term
explicit checkpointing---the use says "save this state---I may want
to come back to it."

This is somewhat different from the normal kind of checkpointing that
people talk about.  People usually think of checkpointing as saving
very recent changes to disk, so that you don't have to "roll back" very
far in case of a failure---you go back to the latest short-term checkpoint
and resume running from there.

But for real apps, and for real fault tolerance, you often need more.  You
may have states much further in the past that you want to be able to
recover.  This is usually handled by different mechanisms, often brutally
ad hoc.  One of the cleaner mechanisms is the way database systems often
archive their checkpoints, so that old checkpoints are never lost.  This
lets you recover to a state far in the past in case of database corruption,
to try to repair a big mistake.  That's very different from trying to
restore to the last stabilized transaction boundary.

Even for "normal" distributed fault tolerance, though you often have a
hierarchy of recovery points, depending on the scope of a fault.  For
example, a single process may usually be able to recover locally if it
didn't affect lots of other processes---it may be able to roll back a 
little and restart, maybe with help from a message logging system to replay
some messages.  In bad cases, though, you may have to roll lots of processes
far back in time to recover a consistent state of the system.  (e.g., in
the easy cases you may be able to roll a process back a second or so
and replay to get caught up.  In the hard cases, you may have to roll
everything back several minutes.  (Or even hours or days or weeks or
months, in the case of increasingly horrible and insidious failure
scenarios.)

Usually, current systems don't handle this in general and graceful ways.
The "fault tolerance" mechanisms cope with common easy failures (like
a crash of a machine), but lose it completely if you have nastier
problems.  Then you "restore" by doing a _reboot_, and sorting things
out manually to get the system going.

So it seems to me that if you're serious about providing fault tolerance
to *users* (e.g., explicit saves), or if you're serious about fault
tolerance that's general and robust (i.e., graceful degradation), you
have to worry about keeping checkpoints at multiple granularities
of time.

Now back to MS Word...

If you have an incremental save feature, where you only save the
changed parts (i.e., incremental checkpointing) then saving a large
document may be very fast if the changes are small.  If you don't
have a generational GC (or something like it), GCing the same document
may be slow.  For example, if the document doesn't fit in RAM, or
hasn't been entirely loaded into RAM, a full GC may cause lots of
disk I/O, where just saving the changes won't.

Of course, we know how to build generational GC's and incremental
checkpointing, and we know how to combine them pretty well.  If people
just used a nice persistent language (rather than C), this would
therefore be automatic.  You'd at least GC the new stuff that you're
saving at incremental checkpoints.  (This is the normal way reachability-based
persistence works.  The p-store acts like an old generation, and the new
stuff that's reachable from it at a checkpoint is traversed and made
persistent as well.)

You still get conservatism (unreclaimed garbage) in the way generational GC's
do, but an incremental background GC would fix that.

So to make this kind of mess go away, what you want is a persistent store
with incremental checkpointing, generational GC, and a full GC that
can run in the background without messing up the checkpointing. 

Unfortunately, application programmers typically use C, so none of this is
automatic at the language level. Often they end up hacking the checkpointing
and the "GC" independently, and never get the generational property integrated
properly with the checkpointing mechanism.

Also, there is a space issue involved in GCing and checkpointing.  With
incremental checkpointing, you have to save to disk (or other nonvolatile
storage) anything that's modified (and non-garbage).  If your background
collector goes around and modifies everything (e.g., by straightforward
copying), then you'll end up with lots of new copies of stuff.  You
also have to retain the old versions from the users' checkpoints.
(The "saved" states they need to be able to refer to.)

The I/O costs of checkpointing the background GC's work don't seem like a
big deal for this app; the GC can just run slowly, occasionally doing
some work and saving its changes to disk.

The space costs are potentially problematic.  In the worst case, you have 
two copies of everything: the old state that the user said to save, and the
latest state.  A GC can increase the likelihood of the bad cases by wandering
around changing things (e.g., copying them, or modifying mark fields, or
whatever).  It can even make them worse, if you're not careful, by having
two copies of the *recent* state, due to the GC's own changes, as in a
copying GC simply layered on top of a normal checkpointing mechanism.

In practice, for normal apps, a good automatic persistent storage mechanism
will probably do better, on average, than the horrible hacks that people
do in C;  for example, many apps "save" by simply writing everything
out to disk, because it's easier than incremental checkpointing.  But
may apps *do* do incremental checkpointing, and it's good to preserve
its space efficiency while providing GC.

For a lot of apps, it doesn't really matter---if you do have two or even

three copies of a document on disk, it's not a big problem, as long as
incrementally operation keeps the I/O from becoming a latency bottleneck.  
(On the other hand, if you're operating from a floppy---like most
word processors sometimes do, this may be a serious issue.)

As data sets get larger, coordinating checkpointing and GC becomes more
important.  For example, a database may fill almost all of the available
disk, and long-term logging and checkpointing is often done to tape.  The
whole database may be written to a tape (or many tapes) periodically, and
reorganizing the database involves trashing the disk version and reloading
from tape.  In this kind of system, you just don't have disk space for two
or three copies of a big fraction of your data---certainly not everything.  


If you're interested in being able to build large, high-performance,
fault-tolerant systems, all of this stuff bears on your choice of
GC algorithms.  Normal copying and compacting algorithms seem bad---they
change everything by moving everything around, and the generational principle
isn't going to fix it entirely---you may not change everything _often_ but
you may change it often enough that almost every change ends up being written
to disk as part of some long-term checkpoint. 

Nonmoving GC's seem to have an edge here, because they don't change
data that don't change---in principle, they can leave the data alone
(side-effects-wise) if the application does.  (If the app changes them,
they're going to have to be checkpointed anyway.)

Current implementations of mark-sweep and fake copying seem to share
this problem to a large degree.  Often the metadata about which memory
is free is intermingled with the actual application data in memory.
For example, our fake copier has headers on the objects that get modified
during tracing, to move blocks from one set (subject-to-gc) to another
(known-reachable).  Some mark-sweep systems put the mark bits in headers.

Hans's doesn't, if I remember correctly, but when he sweeps the bit table,
he links the free blocks up into a free list.   That may cause unnecessary
checkpointing of garbage memory in a checkpointed system.

Some of the checkpointing costs can be reduced by diffing.  If you save
a clean copy of the page in RAM, you can just write out the diffs
between the old version and the new version, instead of the whole page.
(There's a neat trick for this---just xor the two versions to zero
anything that hasn't changed, and then run-length-encode the diffs.
You'll usually get runs of whole words that don't change, so RLE works
like a charm by compressing away the resulting zeroes.)

Another approach is to coordinate a copying GC with checkpointing in
a weird way.  A good example of this is Andrew Tolmach's time-travel
debugger for SML of NJ.  I'd built a prototype of a time-travel
debugger that coordinated checkpointing with a copying GC to avoid
checkpointing garbage.  Andrew Tolmach (and Andrew Appel) turned this
on its head in a very interesting way.  They use the copying GC to
make the checkpoints, by capturing continuations.  That is, the
checkpointing mechanism is built at the data-structure level, _above_
the GC.

When you make a checkpoint, you do it by making a shared-structure
copy of all of the reachable data, and you make new objects in the heap
to represent shared state.  Since both the old and new program states 
are in the same state of the heap, keeping one copy of the heap
actually retains all of the checkpoints.

This is trivial in Standard ML of NJ (or Scheme) _if_ you don't use
side-effects.  All you have to do is capture the continuation (roughly,
the state of the stack), and that will preserve anything reachable from
it.  That's your checkpoint.  All you have to do is capture a continuation
to save a state;  any part of the heap state that's shared with earlier 
checkpoints will be shared in the resulting reachability graph.  Any
old objects that are part of the new state will simply be shared, and
any new objects that are live will be part of the new state.  The GC
will throw away anything that isn't part of any checkpoint.

If you do have side-effects (and both SML of NJ and Scheme do), it's
trickier, and if you have lots of side effects, you may have performance
problems.  What you do for side-effects is to save the latest state of
a side-effected field at each checkpoint.  (That is, anything that's
side effected between checkpoints gets saved so that it can be restored.)

I believe that Nettles and O'Toole's replication copying GC has some similar
properties, but it's not fresh enough in my mind to comment on.  I'm not
clear on what the implications are for checkpointing costs.