zzz doc
Francois-Rene Rideau
Francois-Rene Rideau <fare@tunes.org>
Tue Apr 30 15:06:02 2002
Hi, jao!
here are a few comments about the docs for zzz.
[To Tunes users:
Jao on IRC wanted to code something for Tunes, and I proposed that to him.
zzz is a side project for Tunes to build a robust persistent job queue,
whose API would be the basis for the further Tunes scheduler.
Jao decided to implement it with mzscheme, on top of linux.
If successful, this project could replace at, cron, init,
the mail queue daemon, and more. Hopefully,
it could handle robust backups and robust web mirroring.
Jao is developping it all in mzscheme,
using noweb as a literate programming tool.
He sent me an early version of the docs, that I'm commenting.
Hopefully, a latter version of it will appear on the tunes cvs.
The details are not important for the current discussion.
Notable sources of inspiration:
dah28's Tube, Erlang, linear logic.
at, cron, init, postfix.
]
------>8------>8------>8------>8------>8------>8------>8------>8------>8------
* send next version on to the tunes@tunes.org mailing-list.
------>8------>8------>8------>8------>8------>8------>8------>8------>8------
Miscellaneous ideas on the design of a future version of zzz:
* the robustification (timeout, kill and restart, keep-alive messages, etc.)
should be done erlang-style on top of a generic message-passing API.
* the user will usually use pre-robustified routines,
or routines that wrap their primitive job (sending email, doing backup, etc)
inside a standard robustifier with various options:
(mirror-site (url "http://tunes.org/") :timeout 5 :on-failure mail-me)
* There are essentially two kind of objects: linear and persistent.
linear objects are handled once. persistent objects remain.
active objects are always linear.
actual code, documents and logs are typically persistent
(though the map from symbol to code/document/log would typically
be a linear object at some higher level).
linear jobs are done once (at);
persistent jobs are respawned at some interval (cron).
persistent jobs can be achieved by respawning linear jobs
(one interpretation of the exponential operator, in linear logic).
* the linux implementation might use the file-system as a repository
for persistence, especially if we don't trust the underlying programming
language implementation too much, and/or we want to parallelize things
at the unix level (separate processes working on separate files).
so that jobs can be recursive, we'd use directories.
* special care should be taken to have robust atomicity:
with unix, this means rename, link, unlink, are to be used a lot
(after an fsync of the file to supercede old content), since they are
the only system calls that guarantee atomicity in simultaneously
manipulating of a non-trivial amount of data.
Such atomic file replacement supposes that temporary files are
easily distinguished by their filename or directory location;
atomic rename(2) supposes files are created on the right filesystem.
* A lower-level driver for persistence would similarly have to deal
with updating superblocks or block heads (with older version of a superblock
being still valid until the new one is succesfully committed).
Low-level drivers for hierarchical persistent data
can be experimented OTOP with read/write and fsync, or mmap and msync
(one thread being blocked by the sync while the other ones go on;
fsync and msync have file-wide coarseness, though;
maybe linux on the proper filesystem has a system call for committing
data with finer grain.)
* to allow for roll-back, job handlers would periodically save their state
or log changes to their state. Beware of atomicity problems.
Opening a file in the job's directory and writing a S-EX to it is fine.
If READ fails, the file wasn't successfully committed.
Job recovery would thus consist in rebuilding the state from logs
before to continue execution. Saving the whole state is the case
when the log has exactly one element (needs be atomically updated).
* READing a job main file (if present)
data (in an reader with properly restricted capabilities)
could very well be the restart function.
i.e., the source for the restart function IS the job's main file.
* Before to recursively delete a done/cancelled/foo job's directory,
atomically move it to a queue of directories to purge. Before to begin
deleting it, atomically move it to a proper place and stop its processes
(calling their rollback function, within a robustifier that has a timeout).
* correctly killing clusters of processes that compose a job is a task
that requires proper bookkeeping. See how unix usually does it by
writing the pid to a /var/run/foo.pid file, which works fine iff
there is only one process for the job, or that it can robustly clean up
the subprocesses it launched, if asked to.
* monitoring jobs so that they are live, that they don't eat too
much resource, that they have the right priority, etc.,
is another painful job that could be done erlang-style,
with some generic monitor code, and job-specific routines to detect
liveness, resource-correctness, etc.
An advanced monitor could even kill -STOP/-CONT depending on load.
* add proper merging/splitting primitives for objects within a queue.
This could be use to optimize jobs, to store logs efficiently,
to provide a synthetic user interface to what the system
is doing or has done, etc.
* hooks on adding objects to a queue (:before or :after methods
on the queueing function) can have monitors intercept things.
This allows for log analyzers to work reactively,
to put timestamps on certain transactions,
to do some access control, usage statistics, etc.
------>8------>8------>8------>8------>8------>8------>8------>8------>8------
Future enhancements for code/heap reification, as compared to the Tube:
* handle circularity the usual way CL handles *print-circle*
* have an incremental way of sending code, by referring to known
procedures and other composite objects through GUIDs or other local IDs,
even though they might be constant chunks of data.
* have some distributed directory mechanism for GUIDs
(this part will replace the DNS, with proper crypto).
GUIDs refer to *projects*, i.e. they are like LISP symbols,
that can be bound and rebound to new values in various contexts;
but some projects are declared constants,
some are declared centrally versioned, etc.
Standard transformations between GUIDs and URLs should exist.
* Base what information the sender writes based on a model
of what the recipient knows. "Models" are themselves first-class objects.
* For connected communication, have a protocol that is failsafe against
optimistic assumptions about what the recipient knows:
the recipient may ask about objects it doesn't have;
the sender may thereupon assume it does have them;
if the recipient decides to forget, the sender will resend, etc.
if the recipient already knows, he can do some hash-consing
with the GUIDs declared in the message.
* For disconnected communication (archiving to a file, making a
package for distribution on CD-ROM, very high-latency communication
such as interplanetary or floppy communication, etc.),
accurate (or more accurately, conservative) assumptions are required
about what the reader already knows, though these assumptions
are not local to every single message, but to a "domain" of messages
being sent together: parts of a stream or a CD-ROM could conceivably
be extracted that are valid messages that assume information available
somewhere else in the stream or CD-ROM.
* see T3P in the Migration subproject
------>8------>8------>8------>8------>8------>8------>8------>8------>8------
Yours freely,
[ François-René ÐVB Rideau | Reflection&Cybernethics | http://fare.tunes.org ]
[ TUNES project for a Free Reflective Computing System | http://tunes.org ]
NAPOLEON: What shall we do with this soldier, Giuseppe? Everything he
says is wrong.
GIUSEPPE: Make him a general, Excellency, and then everything he says
will be right.
-- G. B. Shaw, "The Man of Destiny"