A more generic portable binary encoding format

Francois-Rene Rideau rideau@clipper
Mon, 12 Dec 94 19:38:04 MET


>>[...]
>>    That's why the LLL may be necessary, but only as a part of a more generic
>> effort to allow inter-computer communication of objects.
> 
> Absolutely. So let's define our primitives NOW, and stick to them. When I 
> say primitives, I really mean that! Nothing that we don't absolutely NEED 
> on the low level should be present there. Throw out all unnecessary rubbish.

   Beware: what may ne unnecessary to most may be indispensible to some; and
most may darnly need things to most others deem unnecessary.
   So, we must separate what's useful, from what's just expedient; we must
keep in the "kernel", the "grammar", generic constructs, and put in the
"library", the "vocabulary", specific primitives.


> So, what primitives do we need? I would suggest:

> A. "Passive items" -
> 	1. constants
> 	2. variables
  These are some very basic generic constructs to me...
Perhaps we may also add monotonic variables; but I think it may and should be
put elsewhere...
  And why do you call it "passive" ? Functions can be constants or
variables !!! Being constant or variable has *nothing to do* with being
active or passive...

> 	3. blocks of memory
   All computers do not have a linear memory model (see hardware LISP machines,
or software virtual machines of any kind that may be the underlying system).
   Programmers seldom use linear memory; sometimes they use arrays, but even
when they do, half of the time, they wish the stubborn array implementation
had been implemented in another hacked way more fit their particular
program. Conversely, people using functions, or case statements often wish
their construct was implemented with a raw memory block.
   That's why I think that memory blocks are some very low-level thing whose
use should be decided by the optimizer (whether human or computer), not the
high-level programmer (sometimes the same human/computer, though); but in
concertation with him (high-level annotations may help the optimizer decide).
   I'm not saying that our LLL won't provide these, but that it shan't be
a part of the grammatical kernel of our system, just one of the lowest-level
(and very important) of the standard vocabulary (library).

> 	5. 32 bit integers
> 	6. reals
   These are fine; but should really be included in the standard library, not
the kernel: people may wanna use any kind of integers or reals, with any size,
fix or floating point, etc. On the other side, the computer has got a little
number of different builtin number sizes. We should allow transition from one
to the other, thus support both approaches. There should be generic *library*
routines to handle these...

> 	4. pointers
   Here should be the basic abstraction of the low-level side of our system.
The low-level begins with pointers. Local pointer size is host-defined;
pointer size inside a file may be file-defined (or sub-file defined).


> B. "Active items"
> 	1. verbs
> 	2. "user defined functions" (I will call these "friends")
> 	3. "objects"
I see no clear distinction between these...

>		[...] Specialisation and functionality on
> 		different levels are very dissimilar. Why? Because this
> 		heterogeneity WORKS! I think that if we religiously stick
> 		to a monomorphic design philosophy (eg "objects at almost
> 		all levels" we are making a terminal mistake.
I completely agree there is heterogeneity; what I want is smooth interaction
and communication between different objects, and eventuality of a direct
access between any indirectly linked universes.


> Now, back to my types..
> 
> Integers and reals are self-explanatory.
> Constants would come in two flavours:
> 	a. Universal to the whole system, and accepted by all:
> 		- ASCII CHARACTERS
> 		- a wide variety of symbols eg the Greek alphabet,
> 			characters with accents and special characters
> 			eg. French, Polish, Serbo-Croat..
> 			even Kanji..
> 		- a dictionary of the commonest 64k English words

   Let all these kind of things should be library object, not unremovable
elements of the system !!!
   Various users or implementations may require or provide different languages,
with their own encodings (uncompressed, compressed, etc) and dictionaries.

> 	b. Local and specialised. Eg. If you are an engineer, one (or more)
> 		dictionaries of eg common engineering terms, if you are a
> 		doctor, several relevant dictionaries of medical terms.
   In the library vision, each object (even if multiple copies are around the
world) are uniquely tagged and PGP-signed (err, PGP is recommended but not
necessary; other protocols may be supported). Thus, you just require some
globally identified (more or less distant) object and there it is ! For distant
objects a copy will be made (which is why constant objects are preferred to
variable ones: because no automatic update process is needed, not to talk
about synchronized access or modification...).


> Variables would be locally defined and could be allocated ANY value.
  Let's have variable scoping like in *any* good language (i.e. not C), with
more or less local things; but what imports is: if you "see" an object, you
access it, whether it is implemented as local, remote, distributed, or
whatever...



> Blocks would be collections of words of a requested size. They could 
> occupy RAM or disc space or whatever .. this would be transparent to the 
> user,  and management would be on a very low level, BUT you would be able 
> to examine / acquire performance attributes (How long does it take to 
> move data from block X to block Z). Notice how I said collections of _words_.
> This adheres strictly to my concept of having everything as 32 bit words, 
> and is vital to ensure the integrity and simplicity of the model.

   Firstly, the user shouldn't even see blocks -- this should be transparent !
Do word-processing secretaries manipulate blocks ? No, they manipulate
formatted text documents. Do mathematicians manipulate blocks ? No, they
manipulate symbolic expressions. Do number-crunchers manipulate blocks ?
No, they manipulate reals, matrices, functions. Nobody wants to use blocks
any more than they want to explicitly save documents (imagine a
"Save the documents !" or "Save the matrices !" association...).
   Blocks are fine for implementors, and that's all. When I use the system,
I never ever wanna see blocks or pages. Let the system hackers have all the
fun, and not disturb *unwilling* users with implementation dependencies that
only annoy them (if they're willing, that's another problem: we *must* provide
those ones access to it, as long as it is secure).

   Now, why have always 32 bit words ?
   I mean, this could be something standard in our low-level libraries, but
why force the system in whatever the future may give us to have 32-bit words ?
I admit this could be a low-level convention for our first *implementations*,
but not a system-wide specification ! Will you force a program which uses
integers only up to 10000, but a 12 digit score number, running on a 8 bit
machine, to use 32 or 64 bit numbers ?
   Let's allow users (humans or cimputers) to choose !



> Pointers would be just that - pointers to offsets within a particular 
> block, and again, entirely generic (not giving a damn whether the "block 
> is composed of RAM, disc memory, or whatever) but susceptible to detailed 
> timing analysis.
> NOTE THAT POINTERS ARE IN EFFECT NUMBERS, AND CAN THUS BE MANIPULATED AS 
> NUMBERS - THIS SIMPLIFIES OUR TASK IMMENSELY.
   No, No, No, and yes.
   I mean, surely the internal representation of a pointer in a given host
will be an integer; but let's just not specify that; it will come immediately
on machines with flat memory, but other solutions will fit much better for
other architectures (lisp machines; big multiprocessors with tiny processors,
etc).
   You gain *nothing* at over-specifying things about pointers for portable
code; nothing. Let the human or computer optimizer do this for the particular
architecture the code is executed on.
   In the meantime, let's have abstract operators in the only case where
pointers and integers actually interact: arrays. Inlining them is quite easy.
Not mixing integers and pointers is very simple to explain: that's the only
way to avoid crashes that come from synthetizing pointers with integers...
Surely, you gain nothing at doing the latter, but unportable hacks (which
the human/computer optimizer can do itself, with or without your help through
annotations).



> Now we move onto active items..
> 
> There are three that I consider necessary:
> 
> 1. Verbs. These are _simple_ low-level operators eg ADD, SUB, GET (from a 
> pointer into a variable), PUT (into a pointer location) etc.
> 2. "Friends". These are user defined amalgamations of verbs, constants, 
> etc. They can be used locally, but if you want to transport them to other 
> users or systems, you must package them up into:-
> 3. Objects (modules or whatever). These should preferably be compiled.

   Let's have some LLL-dependant instruction set, undefined at high-level.
This way, we allow further change of LLL, or using any kind of language
(particularly assembly, or Scheme, or the SELF virtual machine, etc) for a
LLL (if you're masochistic enough, you could use plain C as a LLL).
   And why have a hierarchical system at all ?
   Let heterogeneity appear *naturally*, from the fact that we have atoms and
constructors. In our standard file format, there should be a constructor for
modules; but any language expressed therein can provide its own constructors
for its own kind of modules, etc...
   As for packaging objects to move them, that's another (quite important)
issue, linked to that of having global (world-wide) identification mechanisms,
and good algorithms/heuristics to determine the limits of an object.




>> Also note that we may also change the format, if one day we find
>> ways to considerably enhance it by using uncompatible techniques. 
> 
> Aaaaaaaaaaaaaaaaaagh. Why not just get it right first time, and leave lots
> of room for extensions!!
   Of course, that's what we're doing. But imagine that we further see that
with another format, we can skip 5% space overhead, and 20% time overhead;
shouldn't we move from one to the other ? I mean, I hope this won't be the
case; but we should stay open-minded enough for any eventuality.


>> * It must be secure: a signature system may allow to identify the authors,
>> trustees, trusters, of a module, so that only trusted modules may be
>> actually evaluated.
> 
> Absolute security may be difficult (or impossible, in fact probably 
> impossible) to achieve. We do want the system to also work, and not spend 
> most of its time looking for Joe MacCarthays under every bush!
   Absolute security is seldom possible; but in a distributed system, there
are still a lot of idiot-proof checks that can and should be done. Checking
the originator of some trusted low-level code is one of them; ensuring that
all your code is garbage-collection aware is another one.


>> * It must be type-safe: any kind of typing can be supported by the format,
>> from the simplest one (no check) to the most complicated one (check of
>> program proof). Type modules are thus available; use only objects you
>> trust.
> 
> Whew. Tricky. Also, overheads (even in disabling type checking).
   Not so much overhead, and much more security. And not *that* tricky: this
should ne included in the execution module opened to run some executable
object.



>>    Another basic constructor may be the (implicit of explicit) choice:
>> "choose whichever of these you please". So for example, some executable
>> game code may be given in multiple format (i386, M68K, PPC, sun4, 6502
>> or any assembly; or our LLL, or another LLL, or the new version of the
>> LLL, or TAOS' LLL, or ANDF, or even high-level language code !), and
>> the system will choose whichever fits best for speed and/or security.
>> Explicit choice is just the standard tuple/record construct.
>>
> Clumsy, potentiality for _big_ overheads, and generally to be avoided.
   I don't agree. That's just having a generic support for choice in the
system; if humans choose, that's a browser or interaction system; if the
system chooses that's a heuristic or any program.


>>    Here we see that the semantics of the Low-Level Object Encoding Format
>> (LLOEF -- please try find a funnier acronym) are deeply related to those of
>> a high-level language for the system. Actually, the semantics should be
>> *the same*, and the LLOEF *is* the standard implementation of the HLL.
> 
> The low level & hi level are intimately concerned with one another. Screw 
> the LL and you can forget about the HL ever working properly..
   You may screw a language by over-defining things as well as by
under-defining them...

>>    Now, what about the LLL ?
>> Well, we saw that a LLL is only *one specific* way of encoding
>> low-level code in a portable manner; people may choose whichever
>> available LLL they please;
>> but of course, we'll provide the best one ever to be (-8, won't we ?
>> A "same" LLL may come in multiple kind of flavors (e.g. Mike's LLL with
>> 16, 32, or 64 bit stacks).

> Yeeargh. KISS. See my previous comm.
   Simple is to let undefined, not to overdefine. (what was the
other "S" of KISS already ?)


--    ,        	                                ,           _ v    ~  ^  --
-- Fare -- rideau@clipper.ens.fr -- Francois-Rene Rideau -- +)ang-Vu Ban --
--                                      '                   / .          --
MOOSE project member. OSL developer.                      |   |   /
				^
				corrected (2 p's was a french spelling)

Dreams about The Universal (Distributed) Database.       --- --- //
Snail mail: 6, rue Augustin Thierry 75019 PARIS FRANCE   /|\ /|\ //
Phone: 033 1 42026735                                    /|\ /|\ /