A more generic portable binary encoding format

Johan Van Schalkwyk jvs@iconz.co.nz
Wed, 14 Dec 1994 03:09:01 +1300 (NZDT)


Yeehaa. Here we go..

Thanks Mike for your comments. I agree..

On Mon, 12 Dec 1994, Francois-Rene Rideau wrote:

> >>[...]
> >>    That's why the LLL may be necessary, but only as a part of a more generic
> >> effort to allow inter-computer communication of objects.
> > 
> > Absolutely. So let's define our primitives NOW, and stick to them. When I 
> > say primitives, I really mean that! Nothing that we don't absolutely NEED 
> > on the low level should be present there. Throw out all unnecessary rubbish.
> 
>    Beware: what may ne unnecessary to most may be indispensible to some; and
> most may darnly need things to most others deem unnecessary.
>    So, we must separate what's useful, from what's just expedient; we must
> keep in the "kernel", the "grammar", generic constructs, and put in the
> "library", the "vocabulary", specific primitives.

*** NOPE. I am trying to define a MINIMAL and SUFFICIENT set of 
**PRIMITIVES** to create a fast, efficient and general system. So, if you 
feel that I've left out anything, or included anything that is redundant, 
mention that specific thing, rather than resorting to generalities, which 
are irrelevant to the argument!!



> 
> 
> > So, what primitives do we need? I would suggest:
> 
> > A. "Passive items" -
> > 	1. constants
> > 	2. variables
>   These are some very basic generic constructs to me...
> Perhaps we may also add monotonic variables; but I think it may and should be
> put elsewhere...
>   And why do you call it "passive" ? Functions can be constants or
> variables !!! Being constant or variable has *nothing to do* with being
> active or passive...

We'll get there. When (after a few weeks of arguing) it will become 
absolutely crystal clear why I make the distinction!! It all boils down 
to my objective of writing core code of <= 20 lines!

> 
> > 	3. blocks of memory
>    All computers do not have a linear memory model (see hardware LISP machines,
> or software virtual machines of any kind that may be the underlying system).
>    Programmers seldom use linear memory; sometimes they use arrays, but even
> when they do, half of the time, they wish the stubborn array implementation
> had been implemented in another hacked way more fit their particular
> program. Conversely, people using functions, or case statements often wish
> their construct was implemented with a raw memory block.
>    That's why I think that memory blocks are some very low-level thing whose
> use should be decided by the optimizer (whether human or computer), not the
> high-level programmer (sometimes the same human/computer, though); but in
> concertation with him (high-level annotations may help the optimizer decide).
>    I'm not saying that our LLL won't provide these, but that it shan't be
> a part of the grammatical kernel of our system, just one of the lowest-level
> (and very important) of the standard vocabulary (library).


*** DING ** NOPE. I did quite a lot of work with lisp back in about 
1978-9. It has some brilliant features (the basic ideas are really good), 
but a lot of the implementation SUCKS. I have used some features of lisp 
in language design, but I'm not about to re-define my whole approach to 
low level design because some wanker has designed a lisp system that has 
nodes or whatever as its fundamental structure, or has automatic GC, or 
whatever. Lisp stands for "lots of irritating single parentheses, as well 
as List Processing. Also when I was tinkering with Lisp, their string 
handling was atrociously clumsy. Grabbing a chunk of memory (of whatever 
type) and manipulating it aggressively is pretty basic to me. Please 
explain how one can cope (especially _on a low level_) entirely without 
it! Also please explain in what circumstance(s) it is redundant. 



> 
> > 	5. 32 bit integers
> > 	6. reals
>    These are fine; but should really be included in the standard library, not
> the kernel: people may wanna use any kind of integers or reals, with any size,
> fix or floating point, etc. On the other side, the computer has got a little
> number of different builtin number sizes. We should allow transition from one
> to the other, thus support both approaches. There should be generic *library*
> routines to handle these...


NO!! Big fundamental philosophical mistake (in my humble opinion). Size 
of integer (or whatever) depends on "granularity" of system. As mentioned 
before, 32 bits is SUFFICIENT for almost all real problems (except eg 
nonlinear problems etc - at least as far as I can see. Many (most?) 
current systems, and certainly the ones we are using, will provide 32 
bits. So why screw around with multiple options 1 bit, 2 bits, 13 bits, 
27 bits, 31 bits, 32 bits etc when 32 bits is SUFFICIENT? In fact, in my 
opinion, 32 bits is MORE THAN SUFFICIENT. Example: try & create a risc 
system (I mean design hardware +- gate by gate) with 16 bit words 
(instructions..). Near impossible. 20 bits ok. 32 bits -- probably 
excessive. Try the same for many other problems, & you'll probably agree 
that 32 is MORE THAN ENOUGH. So make it our standard. If people want 
less, then they can just chop off the 32 bits to 20 or 8 or whatever. If 
they want more, (eg mucking around with complexity / chaos) then we 
provide a 64 bit float that is removed from our BASIC SYSTEM ARCHITECTURE 
WHICH IS 32 BITS. Again, when we come to coding, having a single 32-bit 
approach will be found to be most effective!!


> 
> > 	4. pointers
>    Here should be the basic abstraction of the low-level side of our system.
> The low-level begins with pointers. Local pointer size is host-defined;
> pointer size inside a file may be file-defined (or sub-file defined).

** DING. I'm referring to a pointer simply as a number indicating an 
offset in a block. Needless to say, this is just the number of dwords 
from the start of the (requested) block.  FILE ??? Doesn't exist. Not a 
primitive. Block, yes. File, unnecessary. Maximum value of pointer is 
conditional upon requested size of block. Out of range forces exception 
or error. (much more about error trapping much later on).


> 
> 
> > B. "Active items"
> > 	1. verbs
> > 	2. "user defined functions" (I will call these "friends")
> > 	3. "objects"
> I see no clear distinction between these...


*** Verbs are primitive "functions", take what they need off stack, put 
back what they want. Friends are user-defined agglomerations (for want of 
better word, eg until we have decided on notation eg pre/in/postfix) etc 
of other primitives. Objects are packages with well defined interacting 
characteristics, and incorporating attributes of encapsulation, etc.

Dreadful analogy : verbs -- enzymes
		   friends - enzyme systems &c within cells
		   objects - cells!


> 
> >		[...] Specialisation and functionality on
> > 		different levels are very dissimilar. Why? Because this
> > 		heterogeneity WORKS! I think that if we religiously stick
> > 		to a monomorphic design philosophy (eg "objects at almost
> > 		all levels" we are making a terminal mistake.
> I completely agree there is heterogeneity; what I want is smooth interaction
> and communication between different objects, and eventuality of a direct
> access between any indirectly linked universes.


Again, this will only occur if interactions are well defined and methods 
of communication are sufficient, and simple

> 
> 
> > Now, back to my types..
> > 
> > Integers and reals are self-explanatory.
> > Constants would come in two flavours:
> > 	a. Universal to the whole system, and accepted by all:
> > 		- ASCII CHARACTERS
> > 		- a wide variety of symbols eg the Greek alphabet,
> > 			characters with accents and special characters
> > 			eg. French, Polish, Serbo-Croat..
> > 			even Kanji..
> > 		- a dictionary of the commonest 64k English words
> 
>    Let all these kind of things should be library object, not unremovable
> elements of the system !!!
>    Various users or implementations may require or provide different languages,
> with their own encodings (uncompressed, compressed, etc) and dictionaries.
> 

It depends on what you want to do. Me, I want a system that really works, 
that can interact readily with real human beings, that ultimately 
understands plain English. The primitives in English are English words. 
If you want to mercilessly hack every English word from our system, go 
ahead. Me, I would quite fancy my schema. I have no more problem with 
having common English words as atoms than I do in having ASCII 
characters. Presumably you would at least allow me ASCII? If we've the 
space, why not shove in a lot of other symbols? I was really pissed off 
the other day because I had to compose a fax to Czeckoslovakia, and 
couldn't find half the characters I needed, so had to alter all the c's 
etc by hand. Etc.



> > 	b. Local and specialised. Eg. If you are an engineer, one (or more)
> > 		dictionaries of eg common engineering terms, if you are a
> > 		doctor, several relevant dictionaries of medical terms.
>    In the library vision, each object (even if multiple copies are around the
> world) are uniquely tagged and PGP-signed (err, PGP is recommended but not
> necessary; other protocols may be supported). Thus, you just require some
> globally identified (more or less distant) object and there it is ! For distant
> objects a copy will be made (which is why constant objects are preferred to
> variable ones: because no automatic update process is needed, not to talk
> about synchronized access or modification...).

I'd like to see the low level implementation of your suggestion vs mine 
before I would be prepared to comment definitively. I suspect that speed 
might differ considerably.


 > > 
> > Variables would be locally defined and could be allocated ANY value.
>   Let's have variable scoping like in *any* good language (i.e. not C), with
> more or less local things; but what imports is: if you "see" an object, you
> access it, whether it is implemented as local, remote, distributed, or
> whatever...


Agreed, in principle. Details, we can argue about. 
> 
> 
> 
> > Blocks would be collections of words of a requested size. They could 
> > occupy RAM or disc space or whatever .. this would be transparent to the 
> > user,  and management would be on a very low level, BUT you would be able 
> > to examine / acquire performance attributes (How long does it take to 
> > move data from block X to block Z). Notice how I said collections of _words_.
> > This adheres strictly to my concept of having everything as 32 bit words, 
> > and is vital to ensure the integrity and simplicity of the model.
> 


I think that Mike has answered this one. We're talking LLL.


>    Firstly, the user shouldn't even see blocks -- this should be transparent !
> Do word-processing secretaries manipulate blocks ? No, they manipulate
> formatted text documents. Do mathematicians manipulate blocks ? No, they
> manipulate symbolic expressions. Do number-crunchers manipulate blocks ?
> No, they manipulate reals, matrices, functions. Nobody wants to use blocks
> any more than they want to explicitly save documents (imagine a
> "Save the documents !" or "Save the matrices !" association...).
>    Blocks are fine for implementors, and that's all. When I use the system,
> I never ever wanna see blocks or pages. Let the system hackers have all the
> fun, and not disturb *unwilling* users with implementation dependencies that
> only annoy them (if they're willing, that's another problem: we *must* provide
> those ones access to it, as long as it is secure).
> 
>    Now, why have always 32 bit words ?
>    I mean, this could be something standard in our low-level libraries, but
> why force the system in whatever the future may give us to have 32-bit words ?
> I admit this could be a low-level convention for our first *implementations*,
> but not a system-wide specification ! Will you force a program which uses
> integers only up to 10000, but a 12 digit score number, running on a 8 bit
> machine, to use 32 or 64 bit numbers ?
>    Let's allow users (humans or cimputers) to choose !
> 

*** 1. my notes above.
*** 2. Because I want ease of use, a small core, portability, correct 
granularity for a practical system. I _loved_ Z80 more than any other 
assembler, but 8/16 bits is not great in the real world. Why make the 
whole system many times more complex just for some idiot tinkering with a 
6502?

> 
> 
> > Pointers would be just that - pointers to offsets within a particular 
> > block, and again, entirely generic (not giving a damn whether the "block 
> > is composed of RAM, disc memory, or whatever) but susceptible to detailed 
> > timing analysis.
> > NOTE THAT POINTERS ARE IN EFFECT NUMBERS, AND CAN THUS BE MANIPULATED AS 
> > NUMBERS - THIS SIMPLIFIES OUR TASK IMMENSELY.
>    No, No, No, and yes.
>    I mean, surely the internal representation of a pointer in a given host
> will be an integer; but let's just not specify that; it will come immediately
> on machines with flat memory, but other solutions will fit much better for
> other architectures (lisp machines; big multiprocessors with tiny processors,
> etc).



*** If we really want to resolve this without bloodshed, we should 
probably examine detailed case studies. Do you have enough knowledge 
about "lisp machines, big multipros etc". I don't.




>    You gain *nothing* at over-specifying things about pointers for portable
> code; nothing. Let the human or computer optimizer do this for the particular
> architecture the code is executed on.


** You gain everything by correct specification. But let us write the 
code, and then decide whether we are "over-specifying" or "appropriately 
specifying" rather than becoming emotional!



>    In the meantime, let's have abstract operators in the only case where
> pointers and integers actually interact: arrays. Inlining them is quite easy.
> Not mixing integers and pointers is very simple to explain: that's the only
> way to avoid crashes that come from synthetizing pointers with integers...
> Surely, you gain nothing at doing the latter, but unportable hacks (which
> the human/computer optimizer can do itself, with or without your help through
> annotations).


I have not (I hope) ever said that pointers and integers are _identical_. 
I merely said that pointers are integers referring to a position in a 
block. This does not imply that you can simply take any old integer and 
"use it as a pointer" and expect to get away without an error - you first 
POINT the pointer to the block, then you play! 


> 
> 
> 
> > Now we move onto active items..
> > 
> > There are three that I consider necessary:
> > 
> > 1. Verbs. These are _simple_ low-level operators eg ADD, SUB, GET (from a 
> > pointer into a variable), PUT (into a pointer location) etc.
> > 2. "Friends". These are user defined amalgamations of verbs, constants, 
> > etc. They can be used locally, but if you want to transport them to other 
> > users or systems, you must package them up into:-
> > 3. Objects (modules or whatever). These should preferably be compiled.
> 
>    Let's have some LLL-dependant instruction set, undefined at high-level.
> This way, we allow further change of LLL, or using any kind of language
> (particularly assembly, or Scheme, or the SELF virtual machine, etc) for a
> LLL (if you're masochistic enough, you could use plain C as a LLL).

This is absolutely fucking nuts. Let us identify the NECESSARY AND 
SUFFICIENT LLL instruction set that we require, not some arbitrary crappy 
instruction set parasitised from (ugh) C.



>    And why have a hierarchical system at all ?
>    Let heterogeneity appear *naturally*, from the fact that we have atoms and
> constructors. In our standard file format, there should be a constructor for
> modules; but any language expressed therein can provide its own constructors
> for its own kind of modules, etc...
>    As for packaging objects to move them, that's another (quite important)
> issue, linked to that of having global (world-wide) identification mechanisms,
> and good algorithms/heuristics to determine the limits of an object.
> 
> 


This sounds like a recipe for inheriting all the ills that plague all 
languages everywhere! (Whoops, sorry, not a very constructive comment).



> 
> 
> >> Also note that we may also change the format, if one day we find
> >> ways to considerably enhance it by using uncompatible techniques. 
> > 
> > Aaaaaaaaaaaaaaaaaagh. Why not just get it right first time, and leave lots
> > of room for extensions!!
>    Of course, that's what we're doing. But imagine that we further see that
> with another format, we can skip 5% space overhead, and 20% time overhead;
> shouldn't we move from one to the other ? I mean, I hope this won't be the
> case; but we should stay open-minded enough for any eventuality.
> 
> 
> >> * It must be secure: a signature system may allow to identify the authors,
> >> trustees, trusters, of a module, so that only trusted modules may be
> >> actually evaluated.
> > 
> > Absolute security may be difficult (or impossible, in fact probably 
> > impossible) to achieve. We do want the system to also work, and not spend 
> > most of its time looking for Joe MacCarthays under every bush!
>    Absolute security is seldom possible; but in a distributed system, there
> are still a lot of idiot-proof checks that can and should be done. Checking
> the originator of some trusted low-level code is one of them; ensuring that
> all your code is garbage-collection aware is another one.

Agreed, at least in part. I have not touched on GC. A whole new ballgame. 
It's now 3am & I'm flying to South Africa tomorrow, so I think it's bye 
for now. I'll continue this as soon as I get connected in South Africa. 
It may be a few days. Bye for now!!!

JVS.