Tunes LLL/i386

Francois-Rene Rideau rideau@ens.fr
Thu, 4 Apr 1996 20:14:31 +0200 (MET DST)


>>    Anyway, we shall try to make most of the code independent from whatever
>> hack we use here, by systematically use access-macros,
>> both for read and for write.
>
> Yes, and preferably find a means to make it more of a mid-level
> construct, being a pointer to some object with a static structure stuck in
> front, as well as run-time, thread-specific/unique data/pointers.
> Presumably, the LLL compiler would use a tiny part of this (the static
> part), but HLL would be able to extend it, obviously with words defined
> for the purpose.  Sound good?
   Quite good.


>>>>[ESP as code pointer]
>>    ESP is fast. ESP as code means RET-based threading, [...]
> Ah, I see. I'll have to think about that a bit. Interesting idea, with
> "complicated" implications. As for modular coding: I'm still trying to
> figure out what kind of architecture to use, whether to go with a
> straight compiler, or use a threaded interpreter, and write a compiler
> afterwards.
   Also please see that RET-based means we either disable interrupts,
or copy/compile one-shot threaded code on stack, which is the main
disadvantage. However, this copying could be done by chunks,
which would reduce some other kind of overhead.
   Again, all these tricks should be hidden, whereever possible.


>>>>[shared stack/heap register]
> Sounds like you're talking about backwards growing heaps.

   Or upwards-growing stack ;) That is, both grow in the same direction,
as they share the same memory space.
(note: another mixed stack/heap can grow down from the other end...)

> So you're saying tangle the stack and the data together?
   Yes. At least, for usual HLL-linked code.
Not for real-time and/or interrupt-handling stuff.

> Won't that make the GC even harder to do?
   No, because there will be only one stack/heap space to lookup
instead of two (well, there can be other stacks and heaps elsewhere, too).

> I would have thought separation of data and stack would actually help.
   What would help the GC (but not the cache)
is to separate a data heap/stack from a pointer heap/stack.

> What you're describing is a lot like the GNU obstacks idea.
   In this case, rather like heap-based continuations
in the spirit of many ML and Scheme implementations
(there are obstacks further down).

> While we could certainly do that, I would think it would get extrordinarily
> complicated to mix a regular stack in with it. I'll have to think about
> that one a while.
   Well, many Scheme and *ML implementations have done it for a long time.



>> [Garbage Collection]
>>> It sounds like you're talking about memory objects
>>> with various attributes, in addition to size.
>>> A flags cell might take of this?
>>    Yes, we will have to. But first we must have a basis to work with,
>> that is, a heap-less FORTH system.
>
> 8( That makes things a lot harder... I'd rather you had said,
> "replacable-heap Forth-like system." I can't do much without some kind of
> heap. On the other hand, I _can_ write the rest of the system so that the
> GC-less heap code can be replaced. (I think.)
   Yeah, what I meant is such a "replaceable-heap Forth-like system",
with a trivial (non-reclaiming) GC implementation plugged in to begin with.


>>>>[32-bit aligned pointers]
> Just makes strings harder. :-(
   Also harder is the requirement to determine the beginning of a chunk
from a pointer, so either we have only chunk-aligned pointers,
or it's a mess (or we have a way to distinguish one from the other, etc).

> Again, I'll have to think about it, which
> I can't do very well right now. (I have 80+ e-mails to read/answer. I'm
> on too many mailing lists. Maybe I'll drop Roland's OS list... ;)
   Yeah, I know that, too. Two weeks ago,
there were several tens of messages per day on gclist@iecc.com.
I hope Roland will find his way, but he's not well started, I think:
he's going to have neither hardware nor software protection,
just like MSDOS or MacOS. Only "trust us serious developers" protection.
Ouch.


>>> Well, I'll try and implement without GC, to get things working. Then,
>>> once some of this stuff is up and running, we'll have to add it, since I
>>> don't know enough about how it works, and would rather debug it in a
>>> working environment, anyway.
>>    Exactly. However, be prepared to modify a lot of things
>> when GC arrives :(
>
> That's the problem, I don't quite know which things will need to be 
> modified. Otherwise, it wouldn't be a problem. Maybe I should wait, study 
> how GC is done, and then write the GC into the heap code from the start. 
> Then I'll know exactly how it works, and therefore know what it will break.
   As for non GC-specific code,
the most important is that
* at safe points, we must be able to find all pointers of any given object
 to interpret them so as to find the pointed object,
 and to correct them back if the pointed object is moved.
* there's always a safe point before do any dynamic resource allocation
 that may exhaust some resource, so we get a chance to trigger a GC before.
* some GC implementations will require that there be
 read or write barriers (that is, code executed anytime certain kinds
 or variables, or any of them, is read or written).
 Many GC's will have trivial such barriers (i.e. no instruction),
 or hardware barriers (done by the MMU at the page level);
 but if we want to support any GC in a modular way,
 so we can speed-test them, we mustn't rely on that,
 so always access GC'ed memory through macros.
* If we want some checkpoint-based resiliant persistency,
 we must control object modifications (i.e. have a write barrier),
 so we always have a consistent version of objects in persistent memory.
* All this GC stuff may take into account additional attributes,
 like being read-only, long-lived, real-time, in physical memory,
 of given type, having references to external objects,
 needing a finalizer, etc.
 There must hence be some way to determine which version of
 all the GC-related macros according to the attributes of an object,
 even though a simple GC implementation may disregard this information.

I hope I haven't forgotten any constraint.


>>>> [Guard pages] show how allocating small or large stuff is different [...]
>>> Ok. So we use them for large items, but not for small ones?
>> No. We use guard-pages for small items, because the use/trap ratio
>> such that traps have little performance impact.
>> Large objects will use an explicit bound check.
>
> I'm confused... I would have thought the reverse: large stacks (or
> expected large) would get 4k page granularity, but (with guard pages)
> would be prevented from overflowing into other things. Small stacks
> (known/predicted likely to stay small) would be allocated on a heap
> somewhere and wouldn't grow at all.
>
> Please try and unconfuse me.
   I misunderstood your question:
what I meant is that guard pages can be used as a write barrier
for dynamically extended heaps and stacks
(i.e. when the guard page faults, there is an attempt to grow the area).
Then I meant that in such a growing area,
only small allocations would use the guard page mechanism,
because the trap/allocation ratio is small, and an n-th of trap
is cheaper than a software bound check for a large n.
however, large allocations with high trap probability
would rather do usual software bound checking,
which is cheaper than a trap.
   Now, surely, as you're talking about what areas should be page-guarded,
of course it's only the large and/or dynamically growing ones !
After all page-guarding is page-grained, so by the time a small object
reaches the guard page, it'll have overwritten other small objects,
(unless you give every small object a full page to live into ;)
Stack that we've proven won't overflow (or that we trust anyway)
needn't be page-guarded. "Stacks" whose behavior is unpredictable
might be implemented as linked lists of objects.


>>> Problem is,
>>> with stacks you can't easily tell how large they possibly will get,
>>> making allocation on a heap risky.
>> That's why the heap dynamically grows,
>> and why GC is here to shrink it when it's too large.
>
> But stacks don't move that well, and when they overflow into other
> objects, it's usually too late to fix them. Now if stacks are grown away
> from other objects, they won't be as likely to break, right?
   I my trick (assuming the heap is growing down),
the stack, being made of *transient* objects, grows down under the
bottom of the heap. Memory is allocated by framing part of the stack
into a heap object. This of course, is only for low-level programming,
and is how the HLL produces code
(ESP as the main shared stack/heap pointer might be quite good).
Any function that calls a subfunction with possible heap allocation
must frame its stack before to call such escaping subroutine.
But that's not any performance problem,
as time-critical code portions do not allocate dynamic memory anyway,
unless you're writing some kind of optimal interpreter for graph reduction,
in which case such a low-level convention would make heap allocation
as fast as possible anyway, or you use some separate stack.


>>> Stacks implemented with a 4k
>>> granularity, is admittedly, probably wasteful, but safer.
>> They should no doubt be used for real-time and priviledged threads.
>>
>>> There are, of
>>> course, times where a 256 byte stack will be enough, but those aren't
>>> common enough, or as predictable.
>>    Threads can still allocate such space if they know what they do.
>> Not a high-priority feature, though.
> 
> Which space? Small, fixed size heap stack, or larger granularity stack? 
> Need to know which would be lower priority feature.
   Page-guarded page-granular stack/heaps are IMHO a high-priority feature:
such stacks are needed for system interrupt/trap handlers,
and it will be most useful for dynamically grown heaps.
Fixed-size stuff is low-priority, unless you have drastic memory constrains,
and will be implemented on top of page-granular stuff anyway.
   The page allocator will have a HLL hook for heuristics,
and there will be a possibility to lock stuff in physical memory.
   

>>>>> Also, I think it is probably a good idea to require
>>>>> that stacks be protected
>>>>> from other threads, or assume that other threads will not modify them.
>>>>    Yeah of course, but for the above.
>>> Meaning only on page-granular stacks?
>>    Not *only* such stacks, but there sure should be facilities for
>> such stacks to exist, particularly for priviledged and real-time threads.
>
> Hmm. That was really vague. (Both of use.)
   I meant that not only page-granular stacks should exist,
but that page-granular stacks are the high-priority.
Now I see I was confused and confusing in that context.
As for more protection (like hiding pages when switching threads),
I don't think it's worth the hassle to support it,
though any low-level programmer may do it explicitly for his
own threads (which can be especially useful for writing a protection boxes
and emulators).

> Page based stacks (with guard pages)
> can be protected with the paging hardware, if this were desired.
   And should systematically be, IMHO: little overhead, great gain,
no two cases to check.



>>>>> [problems of efficiency vs possibility of migration]
>>>>    That's a very difficult question,
>>>> all the more when migration is wanted.
>>>> As I see things, all would be kept abstract by the HLL,
>>>> and an interpreter would manage composition of objects at first.
>>>> That is, objects, including "code" would be structures on the heap,
>>>> and contain little code, but for some optimized routines.
>>> Are you talking about source code, or something like intermediate code?
>>> And this would be HLL standard behaviour?
>>    What I'm thinking about would be like intermediate code and/or
>> an interned (preparsed) version of source code,
>> which indeed would be standard HLL behavior.
>> Facilities would of course be provided in the LLL to hand-build
>> and hand-access such constructs, so that the HLL and the LLL
>> live in tight cooperation.
> 
> This sounds like where I was talking about a replacable parser. In other 
> words, allow the HLL to tell the LLL compiler what words it wants to
> compile directly, without using the LLL parser code, or even FIND.
> Let the HLL create its syntax tree, or whatever, and when it wants to
> compile, call the compiler word with a pointer (which the HLL can obtain)
> and let the LLL compiler do the ->binary work. The HLL can then do
> optimizations with regard to choice of words, and factor, inline or
> whatever to suit.
> This would essentially allow multiple interfaces to be built on top of 
> the same LLL compiler.
   Well, this too. However, I'm convinced that *eventually*,
the HLL should do ->binary directly,
as this would allow it to do better register allocation
and instruction scheduling.
   What I meant is that migration needs objects to be as standalone as
possible, while efficiency needs them to be as inter-optimized as possible.
You can't have both perfect at the same time, so there's a compromise.
My point is that by keeping multiple versions of objects,
you can guarantee the high-level semantics,
while benefitting from low-level optimizations:
when you must migrate an object (that is,
trade some low-level component for another),
you discard whichever versions were too specialized to migrate efficiently,
and transpose (and perhaps respecialize) the highest-level version
that can be.
Hence discard linked x86 code when migrating to a SPARCstation,
and recompile/interpret whatever higher-level versions of the object
are available


>>> What you're talking about sounds like compiled objects are disposable,
>>> while source/intermediate objects would always be retained.
   Right.

> So a predigested LLL source in some HLL internal format, which can be fed
> into a LLL compiler would be migratable? (See preceding section.)
   This is *one* solution among others. Surely, the LLL might provide it.
And as usual, the HLL will eventually sort things out,
as we learn to develop heuristics from experience we accumulate.


>>>>>[Compiling words in buffers]
>>    Let's do it with special code primitives,
>> with an interface like GNU's obstacks (that we could steal).

> Obstacks is not only written in C, it's also a mess (for C).
> Reimplementing is probably a good idea, sadly enough.
   Well, let's not steal it, then.

> Anyway, I don't get this:
> why would we allocate space for compiled words using Obstacks?
   Because we seldom know how much space a compiled procedure will
take before we compile it, and then, why recompile it from scratch
instead of storing the relocatable result in obstacks that we
just copy into a one compact relocated/able object.
   Obstacks are just the buffers you need: they grow,
until you've finished, then they shrink to what you'd have had
if you could know the size beforehand.

> I wanted to use a standard (albeit GC'ed) heap,
> with allocations being made after the exact size of the compiled object
> is determined.
   That's precisely the point !
Of course, if you can guarantee that compiles can be done
completely before it is time for another safe point,
then you can just grow the code on the (upward-growing) heap,
and frame it when it's finished.

   Question: what approximate time granularity shall we expect between
checkpoints ? half a millisecond ? half of 55 milliseconds ?
And how to enforce it ?

> I was actually
> toying with a heap implementation that allocates by dwords. It doesn't
> seem to be practical, though.
   Why not practical ?
Anyway, in some future, we might have object-type dependent heaps,
so the size of an object is known by some tag in the pointer.


>> [Relocation of Code words:
>> PIC can be done through relative and indirect addressing,
>> code relocation with complicated reloc descriptor,
>> or slow code recompilation/discard is needed...]
>>    I think we should first implement relocation,
>> because dynamic linking requires it anyway.
>> Of course, as you suggest elsewhere,
>> we could make LLL source the standard "object file",
>> and support dynamic recompilation rather than relocation,
>> at least for a moment.
>> Having only recompilation and no relocation means
>> that we must know the destination address before we compile,
>> and allocate compiled code size before next safe point.

> Words can be made PIC within themselves. It's addressing other words, and
> subsequently being linked to that is the problem. Now there are only
> two solutions that I can see to that one:
>
> 1. use absolute addressing, and never move anything.
> 2. use some kind of table of pointers. threading would have to be done
> through the pointer table. GC could take advantage of this, however. :)
   Again, pointer tables are run-time ever-rechecked relocation info.
So more generally, either we have no relocation info, and can't relocate,
just keep or discard (discard may accompany recompile),
or we do have relocation info, that is all the more complicated
as we manage different kinds of relocations.
It's always a matter of a tradeoff between runtime and GCtime overhead.
   Now, there can be more than overall time spent involved:
we might have additional requirements of GC response time,
consistency with a resiliant persistent data store,
ability for real-time threads to inspect data during GC,
ability to prevent or recover from memory overflows, etc.



>>> [Indirect addressing]
>>    The biggest problem about movable code is *return addresses*
>> (and more generally continuations, as a return address is just that).
>> If these are to be relocated, this means some costly behavior,
>> as we must be able to track down the beginning of any return code chunk.
>>    The first way to do it is to have some complicated call/receive
>> convention, so that return addresses always begin a code chunk,
>> or that some indirect or base+offset return address be used.
>>    Another way is to have some way to retrieve the beginning
>> of a chunk of memory, but that means support in the GC algorithm,
>> so either the GC always does it (=cost), or we add a case to it (=cost).
>>    As you see, we never get something for nothing.
>> My hope is that we can eventually test all possible tradeoffs,
>> and choose the best measured system,
>> while allowing people to recompile Tunes with whichever they prefer.
>> Every address space could have its own GC management, too.
>
> I'll have to think about this one a while. Maybe I'll think of something
> brilliant. (It would be nice...)


>>    Again, remember that:
>> * indirect addressing (including relative addr) is just *runtime linking*,
>>  so that to have a fair idea of the tradeoff,
>>  we should measure total link+execution time for both the tricks.
>> * routines that return before any safepoint needn't care about all that.
> 
> True, but we don't want to be recompiling/relinking any place we can 
> avoid it, so a solution that minimizes computations is obviously 
> desirable.
   Yes, but my point is precisely that we can't know which solution
minimizes computations unless we experiment them all,
which has never been done yet. Which is why we be modular enough
to allow all those tricks to be tried.


>>    Well, why have such restriction [fixed size buffer for compilation] ?
>> Again, something like GNU obstack may be used:
>> you accumulate data on a backward linked list of buffers,
>> and when it's done, it is compacted into a one contiguous buffer.
>
> Because this would mean we'd have to compile and link/relocate in two
> stages, which I was trying to avoid. I was looking for a way to go
> straight from a definition to a binary, which, once complete could be
> copied to a allocated space, and would then work, in place. Preferably
> with minimal overhead/complications. (Nice, clean, fast working compiler,
> which would then be easier to modify/experiment with.)
   That's exactly what obstacks do. The disadvantage of obstacks
is their code bulk; their advantage is you don't have to care about
growing objects beyond predicted limits, and surviving safe points.
I think the advantage far outweights the disadvantage,
all the more when it allows easy escapes to the full power of the HLL,
whereas fixed size buffers just don't.
   As for efficiency, I recall you that if you guessed the right size,
the obstack behaves just like your fixed size buffer.
And relocation information may be useful for a moving GC,
or producing object files anyway, enabling cross-compiling,
deferred-linking, etc.
   Really, there's little to gain with going away from relocation, I think.


>>[The LLL and HLL should be mostly independent through good
>> modular programming style, so first focus on having a prototype,
>> without need to bother with eventual enhancements]
> Right. But all this is helping immensely. Once I study GC a little more,
> I should be able to get some coding done, I think. (When I can get some
> time again.)
   Yup.
   As for GC docs, you can find a lot around Henry Baker's
(hbaker@netcom.com); he particularly has designed a real-time moving GC,
for use in any mission-critical environment !
   See his page, as pointed to by the Tunes pages...

--    ,        	                                ,           _ v    ~  ^  --
-- Fare -- rideau@clipper.ens.fr -- Francois-Rene Rideau -- +)ang-Vu Ban --
--                                      '                   / .          --
Join the TUNES project for a computing system based on computing freedom !
		   TUNES is a Useful, Not Expedient System
WWW page at URL: "http://www.eleves.ens.fr:8080/home/rideau/Tunes/"