POS

Paul Prescod papresco@calum.csclub.uwaterloo.ca
Mon, 12 May 1997 14:31:27 -0400


BRIAN SPILSBURY wrote:
> Application oriented check-pointing is going to cause pain and agony
> in the universe at large unless you want to have unix-style apps with
> minimal crossover. Imagine I have my 'mail' app which relies on the
> state of some 'queue' app. Now this queue app is being used by a bunch
> of things, so we can't just throw it into the 'mail' app.
> 
> Off we go, our mail app checkpoints. Question. Does it save information
> in the 'queue' app? If so, where do we stop? If not, why do we bother?

I don't really understand the problem. Does the mail program only add
thing to the queue when they are to be sent? If so, the mail app calls
the (queue-this) function which passes a pointer to a mail message
object to the queue object which could be running in the same thread or
another one.
 
> If the system explodes before the queue app also checkpoints then we have
> now got two applications with disjoint views of the universe.

The queue object should checkpoint whenever a message object is added to
it. If the system dies before it does this checkpoint that message is
lost, but I believe this to be an unavoidable problem in multitasking
systems. A "single-thread of execution" model that starts and ends with
a REPL would not have this problem but as soon as you get two threads of
execution they are presumably checkpointed at different times. This is
especially the case when you have multiple processors.

So yes, there is the possibility in any networked system, even when the
"network" is just a network of processors, for having crashes with data
out of sync. I consider the likelihood of this to be very, very low, but
non-zero. Do your propose, then, to halt all CPUs and checkpoint them
all at once?
 
> This really bites, not to mention that you now have a special notion
> of 'application'.

No we don't. We have notions of "objects" and "threads of execution."
The notion of an application will naturally arise as a group of objects
that work together and I would not be at all surprised if at some later
time this idea gets formalized (for instance for security reasons, or
package distribution).
 
> I don't know how many milliseconds of user time you'll lose, but I'd guess
> that it won't be significant, noting that before we write the memory to disk
> we can copy it elsewhere, tell the disk-writer-thingy to please write this
> chunk, mark that page we've copied as 'probably clean', and if its
> not dirty when we've finished writing our copy, we can mark it as
> 'really clean'. While we're writing to disk, your ray tracer can happily
> chug along.

But what if I simply don't want to allocate 64MB to an intermediate
stage of a process that I can just restart? An even better example is an
HTTP server where the current state of the app is absolutely trivial and
useless. I don't even *want* that state back if the machine reboots.

Note also that the program might be writing to memory faster than the
checkpointer can write to disk. How do you get a conherent vision of all
of RAM without slowing things down?
 
> A program oriented commit will cause more problems than not unless you
> put your programs in boxes and don't let them talk or share data very
> much, which kinda spoils the point of a lispos :)

I think that programs live in boxes called "objects" and they talk and
share data by "messages" or "functions." If there is a single thread of
execution then a single, system-controlled checkpoint can stop things
from getting out of sync, as you describe, but with multiple threads
(esp. on multiple processors) the system will presumably have to
checkpoint one CPU at a time and even with a single CPU there will be a
problem "keeping up" with processes that write a lot of memory quickly.
 
> One organizes his workspace so that it can be cleaned.
> Consider the functional nature of a unix directly. We can
> 'add', 'list', 'remove', 'link', 'timestamp'.
> 
> So we can think about using something even as simple as an alist.
> 
> ((important-files (read-me . ??) (todo . ??)) (boring-files (dont-readme . ??)))
> 
> then we can write an (ls) function if we really want to and get this
> 
> (ls)
> .
> ..
> important-files
> boring-files
> 
> (ls important-files)
> .
> ..
> read-me
> todo
> 
> So, I don't really see that we need to beat people around the head
> to make them organize themselves in one particular way, its pretty
> trivial to do.

If this "ls" program is going to be useful everywhere, then we MUST beat
people over the head so that they ALL use an ALIST. But people on this
mailing list (including me) want much more interesting document control
than that. We want versions, meta-information up the wazoo and so forth.
And presumably we want to be able to work with all objects defined
across the system uniformly, no matter which program created it. So we
must define a standard for organizing application data such that it
conforms to the needs of the maintenance program. By the time you've got
this standard working, you have reintroduced most of the code that the
transparent filesystem was supposed to get rid of.
 
> Well, I would have thought that you'd flag things at creation time
> as to how you wanted them to persist. Something like wrappers, for example;
> 
> (allocate-synchronous
>   (make-instance 'big-important-thingy))

Unfortunately this instance of big-important-thingy doesn't have a
*name* so the user can't know when it is safe to clean it up. On Unix if
I notice that my HTTP server is taking up too much memory (e.g. it has a
leak) then I kill it and start again. Navigating memory blocks (whether
written in C or Lisp) is too hard because they are only named by cryptic
programmer-names. Yes, Lisp apps can have leaks too (of a different
sort). But when I kill my persistent-HTTP server all of the data that
was associated with it disappears (e.g. configuration files)!
 
 Paul Prescod