[virtmach] Virtual Machines and Code Representation

Peter William Lount peter@smalltalk.org
Fri, 21 Jul 2000 15:36:24 -0700


Hi,

What is a virtual machine? Is it an interpreter? Is a virtual machine a
collection of primitives? Is it the low level components that tie a system
together?  Certainly the Smalltalk, Self and Java systems out there have
virtual machines. What about squeak? It has an interpretive VM that is
written in minimal Smalltalk and "compiled" to C. What are the boundries of
a VM and when is a VM not a VM?

In looking at various VM's, starting with early Pascal systems that used
P-Codes, most systems use byte-codes as an intermediate representation for
program code. Some VM's interpret these "byte-codes" while others compile
the byte-codes to native code just before execution (thus Just In Time or
JIT compiling). Oberon changed this by bypassing the "byte-code" stage and
using the compiliers parse-tree representation directly. Other systems have
intrepreted the parse-tree instead of compiling to native code. So there
are a variety of techniques that a virtual machine can use. 

Whether or not a particular Virual Machine uses byte-codes or parse-trees
for storing, interpreting or compiling to native code is not germain to the
issue of whether it's truely a virtual machine or not. This is a completely
independent issue.

For example, Smalltalk has about 240 bytes codes whereas the Self system
has reduced this to 7 (or 8) byte codes. Certainly if one were to modify
Smalltalk or Self so that they stored their program code as 'parse-trees'
and compiled directly to native code would not change the fact that they
are Virtual Machines. 

This leads us back to the question of what is a virtual machine? It seems
that it's something that isolates the "real hardware machine" from the
language or system above it. Almost like a hardware abstraction layer that
newer operating systems have. A virtual machine can consist of a number of
primitives. The original Smalltalk systems have about 240 primitives mostly
linked with byte codes while Self has hundreds of primitives which are not
linked to the 7 (or 8) bytes codes. A virtual machine assists in
portability across different kinds of hardware and operating systems. In
addition a virtual machine may be it's own operating system (as in the
original Smalltalk-80 system and the newer JavaOS).

The advantage of the "parse-tree" representation of programs and program
fragements is that there is more information avaliable about the context
that the code is in that gets lost with byte codes. Simply put, byte-codes
lose information. They have enough information to compile to native code,
but if that extra information was still avalible then it's possible that
additional optimizations could be applied to the code.

Newer processors like the Pentium III, PowerPC, PA-RISC, Alpha and now the
IA64 Itanium are pushing the compiler optimization technoloy beyond the
qualility and speed of what could be produced even a year or two ago. In
particular chip architectures like the IA64 will push software compiling
technolgy further and faster into the future since these chips have been
designed to rely heavily on software optimization, rather than hardware
tricks, to gain significant speed improvements. It's just silly and short
sighted to cripple a virtual machine with fixed byte codes in the light of
these new hardware improvements. The more flexible parse-tree
representations keep information that byte-codes don't and thus are much
more useful to code optimizers.

The pase tree representation is also elegant in that it simplifies the
number of representations from four (source text, parse tree, byte-codes,
native code) to three (source text, parse tree and native code). Simple but
not simplistic is better. Compressed parse-trees offer better space
efficiency than byte-codes. Smaller sizes means faster loading times.

At the 1998 OOPSLA a question was asked when the JAVA byte codes would be
improved. The answer, from a member of the team at SUN who is responsible
for the byte codes, was that it's very difficult to change the byte codes
since there are so many virtual machines out there from different vendors.
All would have to upgrade their systems at the same time. A difficult and
potentially nasty problem that essentially left the question only vaguely
answered with "were working on it and hope to have an upgrade in the
future". Not a pretty picture.

If Java was based on a parse-tree representation then there would be no
byte codes in need of being added or replaced. The source text would simply
be parsed into a parse-tree representation, compressed, stored and
transmitted around. When a java client (lets say a web browser) gets a
compressed parse-tree it simply compiles it to native code, intreprets it,
or even compiles it to byte-codes and then to native code. Byte codes make
a huge difference to the interoperatability of the "java" programs
distributed as "byte-codes" amoungst the various vendors virtual machines.
A compressed parse-tree representation makes NO difference in how any
particular vendor implements their version of the virtual machine. They can
make the best choices based on their own criteria leading to vastly
different virtual machine designs that can all run the same code. Try this
with your grandfather's byte-coding virtual machine. 

Of course there must be a "standard" or agreement on the "binary" and
"object" or "data structure" format of the compressed parse-tree
representation for this to work. This is an area that is well understood
from compiler research and production systems - although I'm sure that
there could be competing views on the best parse tree representation as
well. Ultimately the best representation for a program may be it's source
code and the language standard. This then bypasses all the above and gives
ultimate flexibily to the virtual machine implementor. 

The Envy Smalltalk code management system stores "source code" and avoids
all the pitfalls of byte-codes, parse-trees, native code and their like.
The system works incredibly well and is impervious to byte code changes
that bring Java to it's knees.

The beauty of the parse-tree representation does not stop with these
advantages. Since the parse tree is a direct reflection of the human (or
machine) written source code the parse-tree can be used for many other uses
including, but not limited to, visualization of the program, debugging,
education, storage, version control, self modifying code, and merging of
two small code fragments. Design symmetry and zen of minimalism at it's
best.

These advantages, and others, make for a compelling argument in favour of
modifying current and future virtual machine designs to use compressed
parse-tree represetations, like the Oberon system has pioneered. As the
state of the art in virtual machine technology progresses it may be that
compressed parse-trees will become the gold standard for virtual machines
and that bytes codes will become a thing of the past.

When building a system many factors come into play and basic design
choices can have huge impacts upon the results. The choice of
representation of programs in a virtual machine is one of these significant
choices. Yet supprisingly existing virtual machines could be retro fitted
without gutting the entire system. Virtual Machines are also a collection
of design choices that provide benefits over other systems. The exciting
part is that it seems that the range of possible, usefull and effective
design choices is expanding almost as fast as the hardware systems that
they make use of. Enjoy.

Food for thought,

All the best,

Peter William Lount
peter@smalltalk.org
http://www.smalltalk.org