[LONG] comments on Tunes and the list archives
David Jeske
jeske@home.chat.net
Sun, 4 Oct 1998 12:09:07 -0700
Please forgive any redundancy with what I'm going to say and what
Tunes discussions and documentation already says, and if possible
point me to a discussion. I've tried to tie in my thoughts as much as
possible.
This is split into "Practical" and "Logical" sections:
I. Practical
A. Kernels
Macro vs Micro vs Exo kernel designs have largly been centered around
choosing a different 'speed vs. safety' tradeoffs. However, few groups
strive for new architectures which achieve both. The MIT Exokernel
has done interseting improvements in the "speed + safety" sum, particularly
for network stacks and multiple-foreign filesystem coexistance. I
know relatively little about other systems which actually achieved
a performance speedup, but others have tried. The special case approach
(as used in the MIT exokernel) has been to create a domain specific
language which can be safetly brought into the kernel so that it can
run before the expensive kernel->user space transition. In fact, while
the early MIT Exokernel actually compiled these DSLs to machine code,
the current i386 version uses no such fancy runtime compiling, thus
demonstrating that the entire "speed + safety" advantage is gained by
being able to push code across the kernel->user space boundary.
(IMO) We should be focusing on removing the presence of this boundary to
the 'high-level language' so that the TUNES translator can
transparently decide which side of that boundry (and others)
generated instructions sit on.
1. Microkernels
A simple example of the paradox of preserving the kernel->user boundary
can be found in any multi-server microkernel. The logical safety of
the microkernel comes from providing more fine-grained isolation
than the macrokernel. However, the mechanism used by the microkernel
is the hardware MMU. This results in many lines
of code in every server which are dedicated simply to shoving data into
and pulling it out of an IPC stream.
The 'safety/isolation' argument of the micrkernel seeks to isolate code,
which if taking to it's logical conclusion leads to a 'layering' of
servers, where every server does only it's simple job. Take for example
the standard filesystem case. The 'proper' microkernel mechanism for
handling disks is to use a layered approach where the 'raw' disk driver
exports a device which is opened by a 'partition' server, which then
exports partition devices which can be talked to by 'filesystem' severs
(or other partition servers, operating as 'sub-partitions'). This is
a uniform model something like:
<raw disk> -> disk driver -> N * <disk device>
<disk device> -> partition server -> N * <disk device>
<disk device> -> filesystem server -> filesystem
<disk device> -> database server -> database
Which (being repetitive for the sake of clarity) would often look
like this:
<raw disk> -> disk driver -> partition server -> filesystem server
-> filesystem server ->
database server
However, microkernels themsleves end up repeatedly violating their safety
arguments for speed. For example:
Mach: pulled the entire disk driver and filesystem codeblock into the
micrkernel itself
VSTa: pulls the partitioning code into the disk driver
2. How this applies to TUNES
To directly respond to a passage from long ago on the list:
>>>> Apart for microkernels (that completely suck), I completely agree.
>>
>>> Well, can you explain WHY a microkernel sucks?
>>
>> He seems to think they're slow. I'm just listening, but I
>> haven't heard why yet.
>
> On a single processor machine sending messages to itself, yes it may be
> slow (although in my experience they haven't been), but on a
> multiprocessor machine (or networked) they gain speed since
> processing can proceed in parallel.
Given that paragraph, I'll pose a question and then answer it.
"When should we be incurring the IPC overhead?"
The answer is: ONLY when it's going to give us a speed advantage.
If we trust the above paragraph as law (for the sake of discussion),
then the answer is, we should only incur the IPC overhead when we
are operating on a multiprocessor machine AND when the two halves
of the communication are not lockstep, AND when the user->user IPC,
plus hardware context switch overhead, plus user->user return IPC
will take a small percentage of the probable 'work time' of
the other side.
Which means that we need to abstract out the multi-server/IPC boundary,
and allow the translator to decide whether the generated code is
structured into a macrokernel (i.e. no IPC, no user->user space boundary)
or a microkernel (i.e. full serialized IPC, user->user space boundary)
physical orginization.
In simple cases like the 'partition server' I think it's pretty clear that
the simple math required to direct disk access from the relative
<partition 1 + 1000> to the absolute <block 5000> (i.e. assuming the
start of part 1 at block 4000), will just be inlined into the same
address space as the disk server, much as VSTa has manually done. However,
if we allow the translator to do it, we win in many ways:
- we guarantee safety for that code
- we allow easy plug-in of different implementations of the same
block of code. This is because they will fit into the layered
'model', but they don't easily fit into VSTa's or Mach's
optimized 'fast path'.
- we can put layers of indirection anywhere, while only incurring the
computational overhead of that layer.. wheras in a microkernel, most
of the time it's the 'structural' overhead of the layer which is
not acceptable.
B. Translator
I've come up with many of the exact same thoughts as are expressed in the
'Meta-translator' documentation. Some of my justifications for the
translator concept have been a little more practical:
1. Allow cross boundary (kernel->user, function call, net IPC) optimization
I talked about improving both speed and safety by optimizing across
the kernel->user (or user->user) space boundary above in the 'Kernel'
section. However, this occurs just as much across units as small as
a function.
There was a compiler created for MIPS which I'm not entirely familiar
with which I understand would perform cross-function 'whole program'
optimization to give a net-speed gain. However, as software is
layered into shared libraries, dynamic loaded code, or polymorphic
code, the ability for a 'compile time' system to perform these
optimizations is removed.
As I'm sure you're all aware, SELF demonstrated that these
optimizations can be performed at run-time. However, SELF found that
a single execution of a given program context was a good indication
of the types required for other invocations of the same context. In
other words, these kinds of optimizations remain largly static in the
case of static bindings. I believe that much of the reason the SELF
environment is so heavy to execute (large memory requirements) is that
binding withing the environment is done in a 'language is the system
image' style like Smalltalk. If a 'strictly' layered approach were
taken, where each layer talked only to the layer below, I believe that
a SELF like approach would have much lighter memory requirements, and
have far fewer 'polymorphic' relationships to generate dynamic code
for. (I'm not advocating the SELF VM as the basis for the meta-vm,
however, it does have many of the properties required, as I'm sure
you are all also well-aware from the Merlin/TUNES overlap)
2. Financial advantages for using translation
Current x86 CPUs are constantly performing the work of translating
x86 instructions into simpler RISC ops. Furthermore, most modern
processors use quite a bit of room on the ASIC for blocks and routing
of sections which serve only to maintain the 'serial' nature of the
instruction stream while exploiting instruction level parallelism.
I would like to remove the work that the hardware is repetively
doing to schedule instructions and retire them in sequential order.
Instead opting to do a multi-level cache of translated code,
and allowing the compiler to manage the hardware's internal state.
In a sense, I'd like to do something closer to a 'pure VLIW' piece
of hardware, and a software translation and caching scheme to run
non-native instructions on the low-level hardware.
Even if the translation scheme has less optimal performance than modern
'translation in hardware' processors, there are other advantages to
be gained, namely:
- cross function optimization can better utilize registers, and avoid
nested conditionals which are incredbly bad for pipeline performance.
- currently 'binary compatibility' demands an unwarrented price premium
in hardware (i.e. intel chips are too expensive)
- the ability to choose underlying hardware based on performance needs
will increase. (if you need much better floating point performance,
get a chip with more or faster floating point units)
I really believe the biggest argument in support of the translator is
that the economics dictate hardware needs to be cheaper. As the number
of users approaches infinity, the per-unit cost of software approaches
zero, while the per unit cost of hardware approaches the production
cost (including production, packaging, testing, and shipping) of
the hardware. The translator (IMO) will allow cheaper hardware to be
built with comparable performance, even if only because it will
do away with the monopoly software momentum of
'binary instruction sets'.
C. Goals
I particularly like this summary of the possible goals I found in the
archives:
> 1. Goals
> --------
> I'm still not sure about the long term goals of the (tunes)
> project. What do we really want to do?
> - just have a bit of fun writing YET ANOTHER "O/S"
> (however smart, cute, fast, etc)?
> - make a really significant contribution to "o/s" design?
> - kick Mickeysoft up the a*** by writing such a hot-shit system
> with universal application that everyone leaps on the
> bandwagon and uses it extensively on their PCs
> (dream on)...
> - slavishly pay tribute to some "object-oriented" paradigm
> simply because it seems like a good idea..
> - etc?
I can't speak for everyone, and this is a grave oversimplification.
However, for me the core idea of a Tunes like system is
"administration-less data-orginization and visualization". This is
apparent in the original 'Tunes CD database case study', and in
most discussions about Tunes.
Tunes could just as likely live within another OS as it could be
another OS... especially for the purpose of market penetration...
II. Logical
A. Translator
1. Providing a flat space for 'self-describing codeblocks' to accumulate
That probably sounds confusing, I don't have a better way to describe
it yet, hopefully some can help after reading this section.
The discussions I've seen on the meta-translator don't emphasize the
idea of 'establishing uniqueness' and then 'layering information' on
top nearly as much as my own ideas do. I have admitted to myself that
we (in the computer community) clearly have not solved the translation
problem, and it's possible that it will never be truly solved.
Instead of focusing on solving the translation problem up front, my
interest has instead been focused on creating what I think of as a
rigid 'super-typed' system for describing blocks to the system. The
translation system itself (as I envision it) dosn't actually know
the right way to translate things, but instead merely provides the
lowest common denomenator framework for storage of blocks (data and code)
so that externally written translators can do useful translations.
Imagine this simplified case: current day compiling of a C program.
Currently, there is no way to uniformly describe to a system a block of
C code such that it can plug in different 'compilers' and produce
output data. Furthermore, that output data is not structured in a way
in which the system can do anything with it. Instead systems up to
this point largly rely on 'add-hoc' orginizations which they create
in the 'single heirarchy filesystem'.
Imagine this simplified case, where the _only_ change that we make is
to rigidly specify the paramaters required to compile a C program into
a workable solution. Where the 'compiler' might not be a
compiler at all, but an interpreter, or
it might be a compiler which produces an intermediate form of the program
which would need an interpreter to run, but which was specified in a
form so that the system as a whole _knew_ what interpreter it needed
to run because it identified itself in a robust way. Along with this
the program itself would need to be changed slightly to be able to
access whatever datafiles were part of it out of this 'rigid' package
which had been created instead of the add-hoc orginization of a
standard filesystem.
Looking back, that wasn't the best description I've ever given, but it'll
do for now.
If we didn't have any concept of what the appropriate 'high level language'
was, the above system would allow us to create independent multi-level
translation tools which could automatically be plugged-into the system,
and which the system could automatically utilize to 'run' a program
in a given form.
In a nutshell, it's really a simple source to source translation
system where the system actually understands what 'form' each level is
in, and has enough information to automatically run a program to get
to the next level. Just like a simple 'unit conversion system', it
merely knows it has 'miles' on the left side, and it wants 'meters'
on the right, so it has to apply "miles to feet", "feet to inches",
"inches to centimeters" and "centimeters to meters".
I think significant progress towards the idea of Tunes could be made
with such a system, even though there would be no single HLL specified.
B. Aspect oriented interaction/optimization
Tunes strikes me as a concept where functional units should perform work
based on logical constructs, not on optimization details. Optimization
details should really be figured out orthognoally to functional blocks.
If you have not already, check out 'Aspect Oriented Programming' at
http://www.parc.xerox.com/spl/projects/aop/
It's some very interesting, if a bit hard to grok, ideas.
The ideas I've presented above where optimization would occur across
boundaries already starts to incorporate Aspects at the most basic
level. Namely, that the sequence of source instructions don't
necessarily translate to a linear sequence of machine instructions,
but instead serve as a description of the work to be done.
C. Removal of 'add-hoc' orginization/Metadata
I've read this section a few times, and I'm not very happy with it, but
through discussion I'm convinced that what I'm trying to say will
eventually come out.
This is speaking more to the 'data management' aspects of Tunes.
Current systems rely on 'add-hoc' orginizations which are created (much
of the time) in the 'traditional single-heirarchy filesystem'. That is,
applications give add-hoc meanings to the heiarchy itself. Worse yet,
the can only give one meaning (because there is only one heirarchy),
so when they need another meaning, they are forced to bury it in a
propritary data-format. Neither meaning is self-identifying (NOTE, that
is very different from self-describing)
I suppose this idea could be described as 'providing more managable
add-hoc' just as much as it can be described as 'removal of add-hoc'.
However, the idea is to allow entities (code or data blocks) to be
only identified as 'unique' in an absolute sense. Then layer whatever
meanings on top of them that accumulate over time. In my opinion, the
important characteristic is that every chunk of data is self-identifying.
That is, there is little confusion about whether a file is a codeblock or
a jpeg file, etc.
Analagously, I should be able to ask the system for data 'about' a
piece of data. It should be transparent where this data came from, and
whether it was derived from the data itself by a codeblock right then,
or whether it was stored as an attached field on the record. There should
be no confusion about what property I'm asking about (use some kind of
UUID/GUID).
For example, there is a jpeg file, and some agent wants to know it's
width and height. It should be trivial for the system to give back it's
width and height (in pixels), whether it has to run code to do it or
not. If requesting the width and height of images is a common operation,
then the system may choose to store (i.e. cache) that data, but the
requester just asks for it.
Converters will inevitably need to be made, either because two people
specify different IDs for the same concept (i.e. width in pixels), or
because two people expect different things for the same ID (i.e.
one requester context assumed that it was width in screenspace, while
another assumed it was width in pixel space of the file).
D. Software Correctness by statistics, not by provability
I have an idea about validating correctness of a given codeblock by
sampling the huge user base over which the codeblock runs, instead of
trying to 'micro-validate' that codeblock.
This can be likened to the ideas I've talked about above, where in
this case, I'm defining 'software compatibility' as an entity, and
layering the 'metadata' about what is compatible on top, as that
data collects, instead of trying to pre-ordain compatibility by
having some developer arbitrarily choose a version number and test
a particular version. Get rid of the 'add-hoc' version number, and
instead allow an expressive way to speak of the compatibility of
different 'guaranteed unique' builds of software.
For example, if a new implementation of a given codeblock
is released, it might be released
in 'test' mode, where the previous implementation would still be run
when the service was required, however, in sparce CPU cycles, the
machine would run the new implementation with the same data, and
assure that (at whatever level it was supposed to) it works as the
old implementation did. This could happen on all machines, and the
results tabulated to a server to be redistributed.
In some cases 'test' mode wouldn't work very well, because the new block
is supposed to return different data. However, it may be enough to
just run with the new block and validate that software 'works'
correctly.
In fact, it may be enough merely to have the 'guru' users submit
compatibility information, and have that translarently trickle down
to the normal end users. Utilizing the fact that the system will have
'cutting edge' users and 'non-skilled' simple users.
Regardless of the mechanism for determining this compatibility
information, it (IMO) is much more powerful to derive this information
from the real-world performance of a codeblock than it is to have
a developer choose to 'intend' it to be 'mostly the same as version 1.4'.
--
David Jeske (N9LCA) + http://www.chat.net/~jeske/ + jeske@chat.net