ReiserFS: Interesting (for me) article

Armin Rigo arigo@ulb.ac.be
Wed Feb 6 08:15:01 2002


Hello Sean,

On Wed, 6 Feb 2002, Sean Hogan wrote:
> the desire. Elderly programs should be untroubled in their operation. Many
> worthwhile projects fail because they emphasize how much they wish to change
> rather than asking of the user the minimal collection of changes necessary
> to achieve the added functionality.

Working on Psyco I recently discovered (more probably became aware of) an
interesting way to handle high legacy compatibility and incremental
support of more features as optimizations only. This is the programmer's
side of what you ask to the user, but I am convinced that both are linked
(the very distinction between programmer and user should eventually become
more of a continuum).

Here is a description of Psyco that gradually switches towards these
Tunesey ideas (thanks to Fare for the discussions and basic ideas).

--------------------------

I am currently working on an extension module for the Python language
(interpreted, strongly trivially typed), called 'Psyco'. Psyco runs Python
scripts several times faster than the interpreter core. It works as an
abstract interpreter that, operating on the Python pseudo-code, yields a
partially specialized version of the code. Specialization is done on
variables, each aspect of which can be specialized or not: its type, it
value for integers, its length for lists, or even the type of the i-th
item for n-tuples, and so on. The Psyco kernel chooses dynamically the
aspects to specialize. The produced code can be raw machine code or
interpreted code of a lower level than Python's bytecode. 

The base technique used by Psyco, which I am trying to develop in a
long-term project, is the following. Psyco is no replacement for the
classical Python interpreter; in fact, it requires it to work. Let us
think about the classical interpreter as a set o f functions written in C.
Executing these functions emulates the execution of the Python code. Psyco
adds 'meta-variables' and 'meta-functions' which stand for the variables
and functions of the classical interpreter when abstract interpretation
takes pla ce. At this time, each meta-variable describes the constraints
that will be satisfied by the corresponding variable during the actual
execution (like having the given type, being stored in such CPU register
at such moment...). Each meta-function abstractl y performs the job of the
corresponding function: for example, to add two Python variables, the
classical interpreter implements a function that has to decode their
types; the corresponding meta-function tries to know these types from
meta-variable constr aints, at abstract-interpretation-time. Ultimately, a
function with no meta-function or whose meta-function fails to handle a
particular case triggers the emission of machine code that will call the
function at run-time. This means that everything works f ine with no
special support; the only goal of meta-functions is to speed up the
execution by moving as much work as possible to pre-run-time. This can
also be seen as providing each function with a default meta-function which
-- unless reimplemented -- just emits the machine code call to the
function. 

The same technique is applied to write machine code. Each 'elementary
operation', like adding two integers, is first written as a C function. To
port Psyco to a new platform, all one has to do is support to emit the
'CALL' machine instruction. If no meta- function is found at
abstract-interpretation-time, a call to the C functions is automatically
written. To get some reasonable performance one should (but don't have to)
supply the Psyco core with meta-functions for specifically supported
operations; the i mplementation of each meta-function is to emit the
corresponding machine code instruction. This is nothing more than an
optimization of the default meta-implementation emitting the C function
call. 

I believe that this technique is interesting because it hints to an open
reflexive system able to use reasonably well the available external
resources (both hardware and software resources). It goes beyond the
'primitive/user-level' dualization of the ope rations provided by a
language. We could start thinking about a more general frame in which each
operation can have one or several implementations at various levels.
'Non-primitive' generalizes to 'with only one implementation, in the same
level as the la nguage'; 'primitive' generalizes to 'without
implementation in the language but with a meta-implementation for each
supported platform'. Such a general frame should be able to integrate much
more interesting cases. 

Psyco is nothing but a small step in this direction: functions have just
one implementation with possibly one meta-implementation or one for some
of the supported platforms. 

A more complete frame should moreover be able to express the levels
meta-implementations link, generalizing the idea that a function written
in C and the machine code that calls it on some CPU are linked by Psyco's
default meta-implementation, and that a compiler can be based on a set of
other meta-implementations able to write better code for elementary
operations. This requires a reflexive system. 

Ideally, the frame will specify very detailed implementations for the
basic operations in a basic language (e.g. the integers and their
operations by inductive constructions). This of course already exists in
various languages. Of course, no compiler is a ble to guess how such
implementations could possibly run fast on modern processors: this
requires expressing things outside of the original language, which
meta-implementations can do: a meta-implementation of the
already-implemented integer addition woul d emit the 'ADD' machine
instruction and express the fact that this gives a better implementation
of the original one in the current context. 

Naturally, the idea generalizes when one considers that the original
language together with each of the target machine language are only levels
between others which should all be defined in the frame itself. 'Levels'
and 'relations between levels' are jus t the usual concepts not only of
source-to-compiled and also of interface-to-implementation. 

Starting from there, the programmer is no longer constrained to a fixed
level by the language it uses, but can build a whole program over numerous
levels by linking them with the meta-implementations it wishes (a
privilege traditionally given to compilers only). Of course, levels are
always implicitly present (in compilers, assemblers, linkers...). To
express them could let us go beyond all the usual tricks one uses to put
together various levels. Besides, the effective formalization of a
language has num erous advantages (e.g.  proofs of correct
implementations, experiments with - better yet everyday use of -- changes
in the language...).

Finally, such a frame would be really interesting when the links between
most levels are loops, that is, when we have given meta-implementation -
explicit or default ones - that let us go from any one level to another,
directly or not. For example, instea d of a meta-implementation of
addition that writes an 'ADD' instruction, one can more abstractly define
the correspondence between language operations and machine instructions
(in data structures) and automatically produce meta-implementations from
it, ju st like one can consider a default meta-implementation in Psyco as
automatically produced from the description of the signature of the C
function. From it, other meta-implementations can be automatically
derived, e.g. to disassemble machine code, or to tr anslate it into
higher-level code that can be interpreted or recompiled on another
machine. 

None of the above benefits is new; each one is a large research subject. 
But I believe that a number of them have seen few applications because of
the unnatural distortion one need to express relationships between levels
inside of one such level. A compil er is split into passes that each do
the translation from a level to another one; when passes are explicitly
coded, the compiler is flexible - one can change the involved techniques,
source or target levels locally. An interpreter is likewise more flexibl e
than a compiler for the same language. On the other hand, a just-in-time
compiler must be very fast and typically has to be deeply rewritten for
each platform. One of the interests of the above frame is to reduce the
costs incurred by flexibility. This can be done by tying (by
specialization) implementations or meta-implementations that correspond to
the various passes that apply in a given situation. Indeed, if they are
written in a sufficiently high-level language, the composition of two
passes can be optimized e.g. by specializing the second pass to the
particular output of the first pass. 

By the way, the same technique allows for theoretically trivial
optimizations, e.g. reducing a JPEG image decoder algorithm into an
algorithm that only extracts the size of the image from a file. (Note that
this reduced a potentially time-consuming algori thm into a constant-time
one.) 

Besides pure optimization, numerous maintenance tasks that high-level
languages mask in the compiler (memory, disk, connections...) can be
expressed as a choice of meta-implementations. The programmer is then free
to choose the ones most appropriate to it s problem. Moreover, expressing
this very fact means that trying several meta-implementations to choose
the most suited one can even be done automatically, without the programmer
or the user being aware of it. (One important application is dynamic
rewriti ng of data structures depending on the current usage patterns.)
Moreover, thanks to some language independency, this is a step towards the
incorporation of research-level techniques (in garbage collection,
caching...) into directly useful industrial systems. 

In summary I consider as a main target the effective formalization in an
open system of the levels a programmer handles, and subsequently all
aspects of a computer that software can handle. 


Armin