Strong typed system

Armin Rigo arigo@planetquake.com
Fri, 5 May 2000 13:05:06 +0200 (METDST)


[First of all, excuse me if the principles I expose have already been
discussed. I am new to the Tunes project and the mailing list archives are
very large !]

There are two basic principles a modern system should, in my opinion, rely
on :

 * as far as possible, the system manages communication
   between applications (let's call them "modules"); it
   is *not* the application that should directly ask the
   system (or other modules) for some functionnalities.

 * the system must know exactly what it is doing at any time.

For more about my position on the first point see
http://student.ulb.ac.be/~arigo/bazar/infobase/why.html.

The second point asks for an advocacy of strongly typed systems. [note:
below I go against some important points of languages like Slate. If I
miss something important please attack my position.]

Let's consider the language level first. It seems to me quite natural to
ask for the type of the objects you are working with to define the
possible operations on them. It seems obvious, but as far as I know no
programming language applies this rule at the syntax level : using an
operator, calling a function or sending a message is always something
defined (and allowed) by the syntax. This is true if you are considering
either Pascal/C++-like languages, in which "foo(a,b)" is syntactically
correct whatever "foo", "a" and "b" are, or Lisp/Forth-like languages,
where it is even worse, because the problem of a fixed syntax is solved by
introducing a very general unified syntax, so that something like "(1 2
3)" is always syntactically correct.

I believe that such a general syntax misses the point. Of course it opens
the door to powerful metaprogramming possibilities, but it lets us write
just too many "correct" but meaningless expressions. This is against
Tunes' philosophy of Precision. Also, the fact that the syntax itself no
longer means anything, in the sense that all expressions share the same
syntax, is not satisfying in my opinion.

I think that the very type of the objects should define the syntax of the
valid expressions that deal with them. This asks for a tight interaction
between the parsing, the compiling and the running passes, of course, but
it is something we do want anyway. For example, a type like "function with
2 arguments" should inherently define the syntax "foo(*,*)" -- and let the
programmer customize it; but my point here is to narrow the gap between
"well-formed" and "meaningfull" expressions.

This solution is to be thought in a more general scope, based on the
principle that the type of the data that the language/the system/whatever
else manipulates is fundamental. For example, with that idea in mind, a
parser-generated binary version of the source code could be very compact,
because there is no need to encode the number of arguments of the
expression, for example -- this is determined by the type of the
expression itself.

This seems to address a number of the problems listed in "LLL
difficulties" (http://tunes.org/LLL/index.html#Difficulties). All data in
memory must be of known type, so there is no need for tricks like
specifying in the low bit of integers whether the data is an integer or a
pointer. Similarily, the "mark" pass of the garbage collector becomes
straightforward and opens to great optimizations.

Warning : I am not saying that there must be a way, given a pointer to
"raw data", to know the type of the data it points to -- neither by
high-level page directoy look-up, nor by adding type information in the
header of all objects. Instead, the way we *reached* an object should be
enough to know the type of the object. This applies to the GC too : during
the traversal of all living objects, it starts from roots of well-known
types, and follows pointers according to the type definitions; the latter
also specify the type of the data the pointer points to. This can be done
in a possibly flexible way, e.g. through type parameters : for example, a
pointer could point to data of type T, with T to be fetched in some other
field of the structure that holds the pointer. For example, a bitmap image
could be a structure with "width", "height" and "pixel-type" fields, plus
a pointer to an array of [height] arraies of [width] items of type
[pixel-type].

How efficient is this for the GC ? Parsing a complex type definition for
each object we encounter is certainly not efficient. We need some kind of
GC-optimized version of the type definition. Note that with reflexivity
this means we can have some automatically generated GC code for each type
of data. This seems to me both easily implementable and the fastest
solution : no need to scan anything to find the pointers to sub-objects :
there is just a routine here that just makes calls to the GC routines
corresponding to the type of the sub-objects with given offsets. These
subroutines are subject to all usual optimizations (e.g. inlining).

A final note on the "infix pointer" problem (pointers pointing inside of a
larger structure) : either we would differenciate between pointers who can
or cannot point inside of another object, or (as I would prefer) the type
of the pointer gives enough information about the "range" of the data that
can be reached through the pointer : in other words, if we have a big
structure in memory which is no longer used, excepted for a pointer to one
of its small parts, this pointer's type should ensure that we can only
reach this small part through it (no pointer arithmetic unless otherwise
specified by the pointer type). The GC would then simply keep in memory
this small part, because it has been reached, but completely ignore (and
thus free) the rest of the big structure. Of course this raises the
question of finalization : in this compact model, the GC would not even
know about the objects it should finalize, so calling finalization
routines would be a problem. Maybe we can believe that in a "brave new
world" of persistence etc there is no more need for finalization. For
exceptions we could consider a special set of memory pages containing
arraies of atomic (raw) objects which require finalization. For example,
if we need file handles that must be explicitely closed, we could have an
array of file handles known to the GC, so that when some region of this
array is freed it can call any necessary finalization routine.


Armin Rigo