[LONG] comments on Tunes and the list archives

David Jeske jeske@home.chat.net
Sun, 4 Oct 1998 12:09:07 -0700


Please forgive any redundancy with what I'm going to say and what
Tunes discussions and documentation already says, and if possible
point me to a discussion. I've tried to tie in my thoughts as much as
possible.

This is split into "Practical" and "Logical" sections:

I. Practical

 A. Kernels

    Macro vs Micro vs Exo kernel designs have largly been centered around
    choosing a different 'speed vs. safety' tradeoffs. However, few groups
    strive for new architectures which achieve both. The MIT Exokernel
    has done interseting improvements in the "speed + safety" sum, particularly
    for network stacks and multiple-foreign filesystem coexistance. I
    know relatively little about other systems which actually achieved
    a performance speedup, but others have tried. The special case approach
    (as used in the MIT exokernel) has been to create a domain specific 
    language which can be safetly brought into the kernel so that it can 
    run before the expensive kernel->user space transition. In fact, while
    the early MIT Exokernel actually compiled these DSLs to machine code,
    the current i386 version uses no such fancy runtime compiling, thus
    demonstrating that the entire "speed + safety" advantage is gained by
    being able to push code across the kernel->user space boundary.

    (IMO) We should be focusing on removing the presence of this boundary to 
    the 'high-level language' so that the TUNES translator can 
    transparently decide which side of that boundry (and others)
    generated instructions sit on.

    1. Microkernels

    A simple example of the paradox of preserving the kernel->user boundary
    can be found in any multi-server microkernel. The logical safety of 
    the microkernel comes from providing more fine-grained isolation 
    than the macrokernel. However, the mechanism used by the microkernel 
    is the hardware MMU. This results in many lines
    of code in every server which are dedicated simply to shoving data into
    and pulling it out of an IPC stream. 

    The 'safety/isolation' argument of the micrkernel seeks to isolate code, 
    which if taking to it's logical conclusion leads to a 'layering' of 
    servers, where every server does only it's simple job. Take for example
    the standard filesystem case. The 'proper' microkernel mechanism for
    handling disks is to use a layered approach where the 'raw' disk driver
    exports a device which is opened by a 'partition' server, which then 
    exports partition devices which can be talked to by 'filesystem' severs
    (or other partition servers, operating as 'sub-partitions'). This is
    a uniform model something like:

    <raw disk> -> disk driver -> N * <disk device>

    <disk device> -> partition server  -> N * <disk device>
    <disk device> -> filesystem server -> filesystem
    <disk device> -> database server   -> database

    Which (being repetitive for the sake of clarity) would often look
    like this:

    <raw disk> -> disk driver -> partition server -> filesystem server
                                                  -> filesystem server -> 
                                                            database server

    However, microkernels themsleves end up repeatedly violating their safety
    arguments for speed. For example:

    Mach: pulled the entire disk driver and filesystem codeblock into the
          micrkernel itself
    VSTa: pulls the partitioning code into the disk driver

    2. How this applies to TUNES

    To directly respond to a passage from long ago on the list:

    >>>> Apart for microkernels (that completely suck), I completely agree.
    >> 
    >>>   Well, can you explain WHY a microkernel sucks?  
    >> 
    >> He seems to think they're slow.  I'm just listening, but I 
    >> haven't heard why yet.
    > 
    >  On a single processor machine sending messages to itself, yes it may be
    >  slow (although in my experience they haven't been), but on a 
    >  multiprocessor machine (or networked) they gain speed since 
    >  processing can proceed in parallel.

    Given that paragraph, I'll pose a question and then answer it.

    "When should we be incurring the IPC overhead?"

    The answer is: ONLY when it's going to give us a speed advantage. 
    If we trust the above paragraph as law (for the sake of discussion),
    then the answer is, we should only incur the IPC overhead when we
    are operating on a multiprocessor machine AND when the two halves
    of the communication are not lockstep, AND when the user->user IPC,
    plus hardware context switch overhead, plus user->user return IPC
    will take a small percentage of the probable 'work time' of
    the other side.

    Which means that we need to abstract out the multi-server/IPC boundary,
    and allow the translator to decide whether the generated code is
    structured into a macrokernel (i.e. no IPC, no user->user space boundary)
    or a microkernel (i.e. full serialized IPC, user->user space boundary)
    physical orginization.

    In simple cases like the 'partition server' I think it's pretty clear that
    the simple math required to direct disk access from the relative 
    <partition 1 + 1000> to the absolute <block 5000> (i.e. assuming the
    start of part 1 at block 4000), will just be inlined into the same 
    address space as the disk server, much as VSTa has manually done. However,
    if we allow the translator to do it, we win in many ways:
     - we guarantee safety for that code
     - we allow easy plug-in of different implementations of the same
        block of code. This is because they will fit into the layered
        'model', but they don't easily fit into VSTa's or Mach's 
        optimized 'fast path'.
     - we can put layers of indirection anywhere, while only incurring the
       computational overhead of that layer.. wheras in a microkernel, most
       of the time it's the 'structural' overhead of the layer which is
       not acceptable.

 B. Translator
  
    I've come up with many of the exact same thoughts as are expressed in the
    'Meta-translator' documentation. Some of my justifications for the 
    translator concept have been a little more practical:

    1. Allow cross boundary (kernel->user, function call, net IPC) optimization

      I talked about improving both speed and safety by optimizing across
      the kernel->user (or user->user) space boundary above in the 'Kernel'
      section. However, this occurs just as much across units as small as
      a function.

      There was a compiler created for MIPS which I'm not entirely familiar
      with which I understand would perform cross-function 'whole program'
      optimization to give a net-speed gain. However, as software is
      layered into shared libraries, dynamic loaded code, or polymorphic
      code, the ability for a 'compile time' system to perform these 
      optimizations is removed. 

      As I'm sure you're all aware, SELF demonstrated that these 
      optimizations can be performed at run-time. However, SELF found that
      a single execution of a given program context was a good indication
      of the types required for other invocations of the same context. In
      other words, these kinds of optimizations remain largly static in the
      case of static bindings. I believe that much of the reason the SELF
      environment is so heavy to execute (large memory requirements) is that
      binding withing the environment is done in a 'language is the system
      image' style like Smalltalk. If a 'strictly' layered approach were
      taken, where each layer talked only to the layer below, I believe that
      a SELF like approach would have much lighter memory requirements, and
      have far fewer 'polymorphic' relationships to generate dynamic code
      for. (I'm not advocating the SELF VM as the basis for the meta-vm,
      however, it does have many of the properties required, as I'm sure
      you are all also well-aware from the Merlin/TUNES overlap)

    2. Financial advantages for using translation
   
      Current x86 CPUs are constantly performing the work of translating 
      x86 instructions into simpler RISC ops. Furthermore, most modern
      processors use quite a bit of room on the ASIC for blocks and routing
      of sections which serve only to maintain the 'serial' nature of the
      instruction stream while exploiting instruction level parallelism.

      I would like to remove the work that the hardware is repetively
      doing to schedule instructions and retire them in sequential order.
      Instead opting to do a multi-level cache of translated code, 
      and allowing the compiler to manage the hardware's internal state.
      In a sense, I'd like to do something closer to a 'pure VLIW' piece
      of hardware, and a software translation and caching scheme to run
      non-native instructions on the low-level hardware.

      Even if the translation scheme has less optimal performance than modern
      'translation in hardware' processors, there are other advantages to 
      be gained, namely:

       - cross function optimization can better utilize registers, and avoid
         nested conditionals which are incredbly bad for pipeline performance.
       - currently 'binary compatibility' demands an unwarrented price premium
         in hardware (i.e. intel chips are too expensive)
       - the ability to choose underlying hardware based on performance needs
         will increase. (if you need much better floating point performance,
         get a chip with more or faster floating point units)
    
      I really believe the biggest argument in support of the translator is
      that the economics dictate hardware needs to be cheaper. As the number
      of users approaches infinity, the per-unit cost of software approaches
      zero, while the per unit cost of hardware approaches the production
      cost (including production, packaging, testing, and shipping) of
      the hardware. The translator (IMO) will allow cheaper hardware to be
      built with comparable performance, even if only because it will
      do away with the monopoly software momentum of 
      'binary instruction sets'.

 C. Goals

    I particularly like this summary of the possible goals I found in the
    archives:

    > 1. Goals
    > --------
    > I'm still not sure about the long term goals of the (tunes) 
    > project. What do we really want to do?
    >         - just have a bit of fun writing YET ANOTHER "O/S"
    >                 (however smart, cute, fast, etc)?
    >         - make a really significant contribution to "o/s" design?
    >         - kick Mickeysoft up the a*** by writing such a hot-shit system
    >                 with universal application that everyone leaps on the
    >                 bandwagon and uses it extensively on their PCs 
    >                 (dream on)...
    >         - slavishly pay tribute to some "object-oriented" paradigm 
    >                 simply because it seems like a good idea..
    >         - etc?

    I can't speak for everyone, and this is a grave oversimplification. 
    However, for me the core idea of a Tunes like system is 
    "administration-less data-orginization and visualization". This is
    apparent in the original 'Tunes CD database case study', and in
    most discussions about Tunes.

    Tunes could just as likely live within another OS as it could be 
    another OS... especially for the purpose of market penetration...

II. Logical

 A. Translator

    1. Providing a flat space for 'self-describing codeblocks' to accumulate
    
    That probably sounds confusing, I don't have a better way to describe
    it yet, hopefully some can help after reading this section.

    The discussions I've seen on the meta-translator don't emphasize the
    idea of 'establishing uniqueness' and then 'layering information' on
    top nearly as much as my own ideas do. I have admitted to myself that
    we (in the computer community) clearly have not solved the translation
    problem, and it's possible that it will never be truly solved. 
   
    Instead of focusing on solving the translation problem up front, my
    interest has instead been focused on creating what I think of as a 
    rigid 'super-typed' system for describing blocks to the system. The
    translation system itself (as I envision it) dosn't actually know
    the right way to translate things, but instead merely provides the
    lowest common denomenator framework for storage of blocks (data and code)
    so that externally written translators can do useful translations.

    Imagine this simplified case: current day compiling of a C program.
    Currently, there is no way to uniformly describe to a system a block of
    C code such that it can plug in different 'compilers' and produce
    output data. Furthermore, that output data is not structured in a way
    in which the system can do anything with it. Instead systems up to
    this point largly rely on 'add-hoc' orginizations which they create
    in the 'single heirarchy filesystem'. 

    Imagine this simplified case, where the _only_ change that we make is
    to rigidly specify the paramaters required to compile a C program into
    a workable solution. Where the 'compiler' might not be a
    compiler at all, but an interpreter, or
    it might be a compiler which produces an intermediate form of the program
    which would need an interpreter to run, but which was specified in a
    form so that the system as a whole _knew_ what interpreter it needed
    to run because it identified itself in a robust way. Along with this
    the program itself would need to be changed slightly to be able to
    access whatever datafiles were part of it out of this 'rigid' package
    which had been created instead of the add-hoc orginization of a 
    standard filesystem.

    Looking back, that wasn't the best description I've ever given, but it'll
    do for now.

    If we didn't have any concept of what the appropriate 'high level language'
    was, the above system would allow us to create independent multi-level
    translation tools which could automatically be plugged-into the system,
    and which the system could automatically utilize to 'run' a program
    in a given form. 

    In a nutshell, it's really a simple source to source translation
    system where the system actually understands what 'form' each level is
    in, and has enough information to automatically run a program to get
    to the next level. Just like a simple 'unit conversion system', it
    merely knows it has 'miles' on the left side, and it wants 'meters'
    on the right, so it has to apply "miles to feet", "feet to inches",
    "inches to centimeters" and "centimeters to meters".

    I think significant progress towards the idea of Tunes could be made
    with such a system, even though there would be no single HLL specified.

  B. Aspect oriented interaction/optimization

    Tunes strikes me as a concept where functional units should perform work
    based on logical constructs, not on optimization details. Optimization
    details should really be figured out orthognoally to functional blocks.

    If you have not already, check out 'Aspect Oriented Programming' at
    
    http://www.parc.xerox.com/spl/projects/aop/

    It's some very interesting, if a bit hard to grok, ideas. 

    The ideas I've presented above where optimization would occur across
    boundaries already starts to incorporate Aspects at the most basic
    level. Namely, that the sequence of source instructions don't
    necessarily translate to a linear sequence of machine instructions, 
    but instead serve as a description of the work to be done.

  C. Removal of 'add-hoc' orginization/Metadata

    I've read this section a few times, and I'm not very happy with it, but
    through discussion I'm convinced that what I'm trying to say will
    eventually come out.

    This is speaking more to the 'data management' aspects of Tunes.

    Current systems rely on 'add-hoc' orginizations which are created (much
    of the time) in the 'traditional single-heirarchy filesystem'. That is,
    applications give add-hoc meanings to the heiarchy itself. Worse yet,
    the can only give one meaning (because there is only one heirarchy),
    so when they need another meaning, they are forced to bury it in a 
    propritary data-format. Neither meaning is self-identifying (NOTE, that
    is very different from self-describing)

    I suppose this idea could be described as 'providing more managable 
    add-hoc' just as much as it can be described as 'removal of add-hoc'.
  
    However, the idea is to allow entities (code or data blocks) to be
    only identified as 'unique' in an absolute sense. Then layer whatever
    meanings on top of them that accumulate over time. In my opinion, the
    important characteristic is that every chunk of data is self-identifying.
 
    That is, there is little confusion about whether a file is a codeblock or
    a jpeg file, etc. 

    Analagously, I should be able to ask the system for data 'about' a
    piece of data. It should be transparent where this data came from, and
    whether it was derived from the data itself by a codeblock right then,
    or whether it was stored as an attached field on the record. There should
    be no confusion about what property I'm asking about (use some kind of
    UUID/GUID).

    For example, there is a jpeg file, and some agent wants to know it's 
    width and height. It should be trivial for the system to give back it's
    width and height (in pixels), whether it has to run code to do it or
    not. If requesting the width and height of images is a common operation,
    then the system may choose to store (i.e. cache) that data, but the
    requester just asks for it.

    Converters will inevitably need to be made, either because two people
    specify different IDs for the same concept (i.e. width in pixels), or
    because two people expect different things for the same ID (i.e. 
    one requester context assumed that it was width in screenspace, while
    another assumed it was width in pixel space of the file).


  D. Software Correctness by statistics, not by provability

    I have an idea about validating correctness of a given codeblock by
    sampling the huge user base over which the codeblock runs, instead of
    trying to 'micro-validate' that codeblock. 

    This can be likened to the ideas I've talked about above, where in 
    this case, I'm defining 'software compatibility' as an entity, and
    layering the 'metadata' about what is compatible on top, as that
    data collects, instead of trying to pre-ordain compatibility by 
    having some developer arbitrarily choose a version number and test
    a particular version. Get rid of the 'add-hoc' version number, and
    instead allow an expressive way to speak of the compatibility of
    different 'guaranteed unique' builds of software.

    For example, if a new implementation of a given codeblock 
    is released, it might be released
    in 'test' mode, where the previous implementation would still be run
    when the service was required, however, in sparce CPU cycles, the
    machine would run the new implementation with the same data, and 
    assure that (at whatever level it was supposed to) it works as the
    old implementation did. This could happen on all machines, and the
    results tabulated to a server to be redistributed.

    In some cases 'test' mode wouldn't work very well, because the new block
    is supposed to return different data. However, it may be enough to
    just run with the new block and validate that software 'works'
    correctly.

    In fact, it may be enough merely to have the 'guru' users submit
    compatibility information, and have that translarently trickle down
    to the normal end users. Utilizing the fact that the system will have
    'cutting edge' users and 'non-skilled' simple users.

    Regardless of the mechanism for determining this compatibility 
    information, it (IMO) is much more powerful to derive this information
    from the real-world performance of a codeblock than it is to have
    a developer choose to 'intend' it to be 'mostly the same as version 1.4'.
    
-- 
David Jeske (N9LCA) + http://www.chat.net/~jeske/ + jeske@chat.net