A more generic portable binary encoding format

Johan Van Schalkwyk jvs@iconz.co.nz
Tue, 13 Dec 1994 01:49:28 +1300 (NZDT)


Dear Joyous Tuneful people..

On Sun, 11 Dec 1994, Francois-Rene Rideau wrote:

> 
>    There's been a lot of discussion about the low-level language for our
> system.
>    Of course we need such language, and it's a big part of our project.
> Now, I think that it's only a part, and not the major one: code is not
> everything, especially low-level one. What the user sees, what we all
> want to use, is high-level objects. A low-level encoding is meant to
> be transparent; 

Which is why it is so important to get it right, in fact, vitally 
important to do so! I firmly believe that the reason why most systems are 
such unmitigated fuckups is because people have skimped on the low level 
philosophy because they're in such a rush to pleasure themselves playing 
around with high-level constructs!!



>    Now, as I said, not only low-level running code is necessary in the
> system: we also need generic high-level code (that sometimes *cannot* be
> compiled efficiently to LLL, and may require a later compile and/or
> interpretation). And we need data: numbers, text, graphics, sound. But
> also data constructors and abstractors: arrays, lists, associations, sets,
> relations, distributions, trees, syntactic trees, graphs, lambda-expressions,
> etc, etc. Actually, we need any kind of high-level object.
>    That's why the LLL may be necessary, but only as a part of a more generic
> effort to allow inter-computer communication of objects.

Absolutely. So let's define our primitives NOW, and stick to them. When I 
say primitives, I really mean that! Nothing that we don't absolutely NEED 
on the low level should be present there. Throw out all unnecessary rubbish.

So, what primitives do we need? I would suggest:

A. "Passive items" -
	1. constants
	2. variables
	3. blocks of memory
	4. pointers
	5. 32 bit integers
	6. reals

B. "Active items"
	1. verbs
	2. "user defined functions" (I will call these "friends")
	3. "objects"

Before I expand on these, I would like to introduce a bit of heresy!

		I do not believe for a minute that it is desirable
		to preserve the same (OO, or whatever) format / structure
		throughout all levels of design! I will do even worse
		than this, I will argue by analogy. Look at _all_ the
		complex "hierarchical" systems in nature (be they quarks
		--> particles --> atoms --> molecules etc) or 
		complex molecules --> organelles --> cells --> tissues
		--> organs --> organisms and, although certain similarities
		may exist on all levels, you will be amazed at the differences
		between the levels. Specialisation and functionality on
		different levels are very dissimilar. Why? Because this
		heterogeneity WORKS! I think that if we religiously stick
		to a monomorphic design philosophy (eg "objects at almost
		all levels" we are making a terminal mistake.

Now, back to my types..

Integers and reals are self-explanatory.
Constants would come in two flavours:
	a. Universal to the whole system, and accepted by all:
		- ASCII CHARACTERS
		- a wide variety of symbols eg the Greek alphabet,
			characters with accents and special characters
			eg. French, Polish, Serbo-Croat..
			even Kanji..
		- a dictionary of the commonest 64k English words
	b. Local and specialised. Eg. If you are an engineer, one (or more)
		dictionaries of eg common engineering terms, if you are a
		doctor, several relevant dictionaries of medical terms..

Variables would be locally defined and could be allocated ANY value.

Blocks would be collections of words of a requested size. They could 
occupy RAM or disc space or whatever .. this would be transparent to the 
user,  and management would be on a very low level, BUT you would be able 
to examine / acquire performance attributes (How long does it take to 
move data from block X to block Z). Notice how I said collections of _words_.
This adheres strictly to my concept of having everything as 32 bit words, 
and is vital to ensure the integrity and simplicity of the model.

Pointers would be just that - pointers to offsets within a particular 
block, and again, entirely generic (not giving a damn whether the "block 
is composed of RAM, disc memory, or whatever) but susceptible to detailed 
timing analysis.

NOTE THAT POINTERS ARE IN EFFECT NUMBERS, AND CAN THUS BE MANIPULATED AS 
NUMBERS - THIS SIMPLIFIES OUR TASK IMMENSELY. Note also that as a block 
is composed of _words_, and each of our data types can be represented by 
a 32 bit word, that a block can contain _any_ item, and any mix of items. 
In practice, I have found this setup to be very powerful.

Now we move onto active items..

There are three that I consider necessary:

1. Verbs. These are _simple_ low-level operators eg ADD, SUB, GET (from a 
pointer into a variable), PUT (into a pointer location) etc.

2. "Friends". These are user defined amalgamations of verbs, constants, 
etc. They can be used locally, but if you want to transport them to other 
users or systems, you must package them up into:-

3. Objects (modules or whatever). These should preferably be compiled.

I have been using a system similar to this for several years, and for me 
it works well. It also seems to be sufficient. Although it seems fairly 
complex on first inspection, it actually is pretty straightforward, and 
interpretation and even compilation are a breeze!


> 
>    A given host may have its own optimized format for in-memory or even
> disk objects; a given parallel multi-CPU system may have its own optimized
> communication protocol. A local net may have its own optimized (or particular)
> object format for specific applications. But we *must* provide some generic
> communication format, so that machines all over the world may understand each
> other.

Yep. For ideas, vide supra.

>    This format must include *everything* possible, or to be possible one day.
> * It must be extensible: new encodings may be dynamically added to the format,
> by just lazily calling an external module (that will have to be present for
> the object to be decoded); see that extensibility *is* power: just *anything*
> should/may be directly expressed using this (extensible) format.

Yep.

> Also note that we may also change the format, if one day we find
> ways to considerably enhance it by using uncompatible techniques. 

Aaaaaaaaaaaaaaaaaagh. Why not just get it right first time, and leave lots
of room for extensions!!


> * It must be easy to deal with: being used heavily, encoding/decoding
> speed is important.

Yes. 

> * It must be compact: for the same reason, size is important; so it must be
> a binary format, with recommendations for optimized text encoding; and it
> should support any kind of possible object compression,

Yes, provided we do not screw up the basics just to save a bit or two of 
space.

> through the natural use of the (recursive) module system.
> * It must be portable: recommendations exists so that the encoding is
> independent on the size or format of words in the architecture.

Portable & compact yes. I'm not absolutely sure what you're getting at 
with the rest.


> * It must be secure: a signature system may allow to identify the authors,
> trustees, trusters, of a module, so that only trusted modules may be actually
> evaluated.

Absolute security may be difficult (or impossible, in fact probably 
impossible) to achieve. We do want the system to also work, and not spend 
most of its time looking for Joe MacCarthays under every bush!


> * It must be type-safe: any kind of typing can be supported by the format,
> from the simplest one (no check) to the most complicated one (check of
> program proof). Type modules are thus available; use only objects you
> trust.

Whew. Tricky. Also, overheads (even in disabling type checking).
> 
> 
>    As I see things, objects are communicated by group, let's say *modules*.
> Actually, it's even simpler to say that's there's objects are communicated
> one by one, *but* that there's a generic way to encapsulate multiple objects
> into only one, which is known as a module (or generic object multiplexer).
>    A module is thus a (uniquely identified) object, that interacts with
> other modules by requiring or providing sub-objects. The basic constructor
> of a module may be the lazy evaluator, with explicit or implicit arguments:
> you ask for arguments with such properties. Implicit ones are given by the
> system (e.g. module asks: "gimme a .gz decompactor"; system replies: 
> "ok, here's gzip 1.2.4"). Explicit ones are explicitly given by the user.
> Of course, the system may also ask for user configuration/interaction when
> giving implicit parameters...

... going up .. See my comment about cells --> tissues --> organs above.

>    Another basic constructor may be the (implicit of explicit) choice:
> "choose whichever of these you please". So for example, some executable
> game code may be given in multiple format (i386, M68K, PPC, sun4, 6502 or any
> assembly; or our LLL, or another LLL, or the new version of the LLL, or
> TAOS' LLL, or ANDF, or even high-level language code !), and the system
> will choose whichever fits best for speed and/or security. Explicit choice is
> just the standard tuple/record construct.


Clumsy, potentiality for _big_ overheads, and generally to be avoided.


> 
>    Here we see that the semantics of the Low-Level Object Encoding Format
> (LLOEF -- please try find a funnier acronym) are deeply related to those of
> a high-level language for the system. Actually, the semantics should be
> *the same*, and the LLOEF *is* the standard implementation of the HLL.

The low level & hi level are intimately concerned with one another. Screw 
the LL and you can forget about the HL ever working properly..

> 
>    Now, what about the LLL ?
> Well, we saw that a LLL is only *one specific* way of encoding low-level code
> in a portable manner; people may choose whichever available LLL they please;
> but of course, we'll provide the best one ever to be (-8, won't we ?
> A "same" LLL may come in multiple kind of flavors (e.g. Mike's LLL with
> 16, 32, or 64 bit stacks).
> 

Yeeargh. KISS. See my previous comm.


> 
>    Ok. Whatdyathinkofit ?

As noted. JVS.

P.S.

> MOOSE project member. OSL developper.                     |   |   /
				  ^^^^^ developer?
01h50. Bedtime. Bye!