Tunes LLL/i386

Nathan Hawkins
Thu, 4 Apr 1996 19:42:21 -0500 (EST)

On Thu, 4 Apr 1996, Francois-Rene Rideau wrote:
> >>>>[ESP as code pointer]
> >>    ESP is fast. ESP as code means RET-based threading, [...]
> > Ah, I see. I'll have to think about that a bit. Interesting idea, with
> > "complicated" implications. As for modular coding: I'm still trying to
> > figure out what kind of architecture to use, whether to go with a
> > straight compiler, or use a threaded interpreter, and write a compiler
> > afterwards.
>    Also please see that RET-based means we either disable interrupts,
> or copy/compile one-shot threaded code on stack, which is the main
> disadvantage. However, this copying could be done by chunks,
> which would reduce some other kind of overhead.
>    Again, all these tricks should be hidden, whereever possible.

Interrupts shouldn't need to be disabled; the x86 will need to switch 
stacks to handle them safely, anyway. As for RET-threading, I'm starting 
to see how this could work.

> >>>>[shared stack/heap register]
> > Sounds like you're talking about backwards growing heaps.
>    Or upwards-growing stack ;) That is, both grow in the same direction,
> as they share the same memory space.
> (note: another mixed stack/heap can grow down from the other end...)

Were you planning on having more than one thread share a heap and do 
this? After some careful thought, I can see a way to make this work, by 
allowing a stack to flow around objects, and let the holes be GC'd later. 
But I can only see this working for an r-stack. A normal data stack 
couldn't easily be made to handle being broken into pieces, whereas 
I've though of a way an r-stack could be hacked to do it. But I don't see 
how threading more than one stack into a heap would work. As for 
upwards-growing stack, you couldn't do it very efficiently on an x86. :-(

> >> [Garbage Collection]
> > 8( That makes things a lot harder... I'd rather you had said,
> > "replacable-heap Forth-like system." I can't do much without some kind of
> > heap. On the other hand, I _can_ write the rest of the system so that the
> > GC-less heap code can be replaced. (I think.)
>    Yeah, what I meant is such a "replaceable-heap Forth-like system",
> with a trivial (non-reclaiming) GC implementation plugged in to begin with.
> >>>>[32-bit aligned pointers]
> > Just makes strings harder. :-(
>    Also harder is the requirement to determine the beginning of a chunk
> from a pointer, so either we have only chunk-aligned pointers,
> or it's a mess (or we have a way to distinguish one from the other, etc).

Depending on how the heap is implemented, it should be possible to figure 
out which chunk a pointer belongs to, by keeping the chunks in something 
like a skip list, and then searching that for nearest lower chunk to the 
pointer in question. It would probably be slow, but it should work. Or 
maybe there are much better ways.

[Roland Nilsson's OS project]
> I hope Roland will find his way, but he's not well started, I think:
> he's going to have neither hardware nor software protection,
> just like MSDOS or MacOS. Only "trust us serious developers" protection.
> Ouch.
Well, we both tried to tell him. Best we can do, probably.

[GC considerations]
>    As for non GC-specific code,
> the most important is that
> * at safe points, we must be able to find all pointers of any given object
>  to interpret them so as to find the pointed object,
>  and to correct them back if the pointed object is moved.
> * there's always a safe point before do any dynamic resource allocation
>  that may exhaust some resource, so we get a chance to trigger a GC before.
> * some GC implementations will require that there be
>  read or write barriers (that is, code executed anytime certain kinds
>  or variables, or any of them, is read or written).
>  Many GC's will have trivial such barriers (i.e. no instruction),
>  or hardware barriers (done by the MMU at the page level);
>  but if we want to support any GC in a modular way,
>  so we can speed-test them, we mustn't rely on that,
>  so always access GC'ed memory through macros.

This part sounds vaguely like Forth CREATE DOES> words, but a bit more 
general or abstract. Is that correct?

> * If we want some checkpoint-based resiliant persistency,
>  we must control object modifications (i.e. have a write barrier),
>  so we always have a consistent version of objects in persistent memory.

That would be difficult to do efficiently in software. Paging access bits 
would do it, but with the usual granularity problem. In this case, it 
might be worth it.

> * All this GC stuff may take into account additional attributes,
>  like being read-only, long-lived, real-time, in physical memory,
>  of given type, having references to external objects,
>  needing a finalizer, etc.
>  There must hence be some way to determine which version of
>  all the GC-related macros according to the attributes of an object,
>  even though a simple GC implementation may disregard this information.
> I hope I haven't forgotten any constraint.

Would it be possible to keep track of those attributes in something very
stable, like static bit fields? It would be simpler, although probably a 

> > Please try and unconfuse me.
>    I misunderstood your question:
> what I meant is that guard pages can be used as a write barrier
> for dynamically extended heaps and stacks
> (i.e. when the guard page faults, there is an attempt to grow the area).
> Then I meant that in such a growing area,
> only small allocations would use the guard page mechanism,
> because the trap/allocation ratio is small, and an n-th of trap
> is cheaper than a software bound check for a large n.
> however, large allocations with high trap probability
> would rather do usual software bound checking,
> which is cheaper than a trap.
>    Now, surely, as you're talking about what areas should be page-guarded,
> of course it's only the large and/or dynamically growing ones !
> After all page-guarding is page-grained, so by the time a small object
> reaches the guard page, it'll have overwritten other small objects,
> (unless you give every small object a full page to live into ;)
> Stack that we've proven won't overflow (or that we trust anyway)
> needn't be page-guarded. "Stacks" whose behavior is unpredictable
> might be implemented as linked lists of objects.

Stack that we've proven won't overflow seems likely to be a small 
catagory. Linked lists of objects would more like a heap than a stack, 
although you can do a heap as a stack, you can't normally break sections 
of stack apart without imposing unpleasant constraints on the code. 
Languages like C do this by placing local variables there, and thereby 
make it difficult to handle variable numbers of parameters or return values.

>    I my trick (assuming the heap is growing down),
> the stack, being made of *transient* objects, grows down under the
> bottom of the heap. Memory is allocated by framing part of the stack
> into a heap object. This of course, is only for low-level programming,
> and is how the HLL produces code
> (ESP as the main shared stack/heap pointer might be quite good).
> Any function that calls a subfunction with possible heap allocation
> must frame its stack before to call such escaping subroutine.
> But that's not any performance problem,
> as time-critical code portions do not allocate dynamic memory anyway,
> unless you're writing some kind of optimal interpreter for graph reduction,
> in which case such a low-level convention would make heap allocation
> as fast as possible anyway, or you use some separate stack.

So far, I'm getting a picture of three separate stacks: data stack, 
r-stack, and a heap stack, which we may be sharing with other threads. I 
don't see how we can interleave all three without some problems. The 
r-stack and the heap can be broken in pieces fairly well, since we can 
assume that a heap allocator would be called, and hence it would be able 
to break the r-stack around the obstack-like heap objects.  Data stack, on 
the other hand, I just can't see working like that.

> >>>>> Also, I think it is probably a good idea to require
> >>>>> that stacks be protected
> >>>>> from other threads, or assume that other threads will not modify them.
> >>>>    Yeah of course, but for the above.
> > Page based stacks (with guard pages)
> > can be protected with the paging hardware, if this were desired.
>    And should systematically be, IMHO: little overhead, great gain,
> no two cases to check.

In this case I meant that for some stacks, especially in critical and 
volatile parts of the system such as IRQ and page fault stacks could
be protected from other threads with the page permission bits. (If we can 
do guard pages on such stacks, we could protect the whole stack.) Most 
things shouldn't need this, but I'm a little paranoid about IRQ 
handlers... I've had too many bad experiences with them, not to be.

> >>>>> [problems of efficiency vs possibility of migration]
> > This sounds like where I was talking about a replacable parser. In other 
> > words, allow the HLL to tell the LLL compiler what words it wants to
> > compile directly, without using the LLL parser code, or even FIND.
> > Let the HLL create its syntax tree, or whatever, and when it wants to
> > compile, call the compiler word with a pointer (which the HLL can obtain)
> > and let the LLL compiler do the ->binary work. The HLL can then do
> > optimizations with regard to choice of words, and factor, inline or
> > whatever to suit.
> > This would essentially allow multiple interfaces to be built on top of 
> > the same LLL compiler.
>    Well, this too. However, I'm convinced that *eventually*,
> the HLL should do ->binary directly,
> as this would allow it to do better register allocation
> and instruction scheduling.

Well, in between then and now, there'll be lots of interesting things to 
explore. I'll probably stick with LLL, but maybe around that time, I'll 
start doing another implementation for some other processor. :-)

>    What I meant is that migration needs objects to be as standalone as
> possible, while efficiency needs them to be as inter-optimized as possible.
> You can't have both perfect at the same time, so there's a compromise.
> My point is that by keeping multiple versions of objects,
> you can guarantee the high-level semantics,
> while benefitting from low-level optimizations:
> when you must migrate an object (that is,
> trade some low-level component for another),
> you discard whichever versions were too specialized to migrate efficiently,
> and transpose (and perhaps respecialize) the highest-level version
> that can be.
> Hence discard linked x86 code when migrating to a SPARCstation,
> and recompile/interpret whatever higher-level versions of the object
> are available

Doing that will require that the predigested version of the objects be as 
small as possible, and possibly marked for swapping. (If we can get the 
GC to work without recompiling things.)

> >>>>>[Compiling words in buffers]
> >>    Let's do it with special code primitives,
> >> with an interface like GNU's obstacks (that we could steal).
> > Obstacks is not only written in C, it's also a mess (for C).
> > Reimplementing is probably a good idea, sadly enough.
>    Well, let's not steal it, then.
> > Anyway, I don't get this:
> > why would we allocate space for compiled words using Obstacks?
>    Because we seldom know how much space a compiled procedure will
> take before we compile it, and then, why recompile it from scratch
> instead of storing the relocatable result in obstacks that we
> just copy into a one compact relocated/able object.
>    Obstacks are just the buffers you need: they grow,
> until you've finished, then they shrink to what you'd have had
> if you could know the size beforehand.

That seems to be a worthwhile heap design, as well as one which fits 
better with the Forth model of the universe than malloc. Kind of like a 
better HERE arrangement, with all the glitches taken out.

As for the relocations, those will be tedious. For threaded code, maybe 
we can avoid them entirely? (Could the GC be able to walk the threads? This 
would mainly require the ability to identfy threaded words and literals 
in them.) Relocation info would probably be similar enough as makes no 
difference. This then would mean working infrastructure for the ->binary 

[exactly sized allocations]
>    That's precisely the point !
> Of course, if you can guarantee that compiles can be done
> completely before it is time for another safe point,
> then you can just grow the code on the (upward-growing) heap,
> and frame it when it's finished.

Yeah, but it would need to be downward growing, and if the guess as to 
the size is wrong, it would could copy the chunk downwards in memory.

>    Question: what approximate time granularity shall we expect between
> checkpoints ? half a millisecond ? half of 55 milliseconds ?
> And how to enforce it ?

I think we'll need to have groups of thread that share the same heaps 
cooperative, and preempt between the groups. This would be interesting, 
but might make that question less vital, as well as give more freedom 
about when to cooperate.

> > I was actually
> > toying with a heap implementation that allocates by dwords. It doesn't
> > seem to be practical, though.
>    Why not practical ?
> Anyway, in some future, we might have object-type dependent heaps,
> so the size of an object is known by some tag in the pointer.

That implementation was based on a static sized free list, and no GC. Bad 

> >> [Relocation of Code words:
> >> PIC can be done through relative and indirect addressing,
> >> code relocation with complicated reloc descriptor,
> >> or slow code recompilation/discard is needed...]
> >>    I think we should first implement relocation,
> >> because dynamic linking requires it anyway.
> >> Of course, as you suggest elsewhere,
> >> we could make LLL source the standard "object file",
> >> and support dynamic recompilation rather than relocation,
> >> at least for a moment.
> >> Having only recompilation and no relocation means
> >> that we must know the destination address before we compile,
> >> and allocate compiled code size before next safe point.

Better to do some hybrid relocation/PIC. Avoid recompilation if possible.

> > Words can be made PIC within themselves. It's addressing other words, and
> > subsequently being linked to that is the problem. Now there are only
> > two solutions that I can see to that one:
> >
> > 1. use absolute addressing, and never move anything.
> > 2. use some kind of table of pointers. threading would have to be done
> > through the pointer table. GC could take advantage of this, however. :)
>    Again, pointer tables are run-time ever-rechecked relocation info.
> So more generally, either we have no relocation info, and can't relocate,
> just keep or discard (discard may accompany recompile),
> or we do have relocation info, that is all the more complicated
> as we manage different kinds of relocations.
> It's always a matter of a tradeoff between runtime and GCtime overhead.
>    Now, there can be more than overall time spent involved:
> we might have additional requirements of GC response time,
> consistency with a resiliant persistent data store,
> ability for real-time threads to inspect data during GC,
> ability to prevent or recover from memory overflows, etc.

Interesting point: which is more important, best possible runtime speed, 
or best possible GC speed/accuracy? And you also point out real-time and 
persistancy to be implemented, so the GC can't hog the cpu.

I'm starting to wonder whether heaps should be managed almost as if they 
were address spaces, in the sense that GC might be done in one heap, 
while threads in other heaps go on working (preemptively). The problem is, 
there would then be extra cost in accessing data in a different heap.
(Possibly a lot of cost. Cross-heap pointers would have to be handled 
_very_ carefully.) Would this be worthwhile? One advantage to this, is it 
would allow to test SMP and distributed processing support, since such 
environments would have exactly the same problem.

> >>    Again, remember that:
> >> * indirect addressing (including relative addr) is just *runtime linking*,
> >>  so that to have a fair idea of the tradeoff,
> >>  we should measure total link+execution time for both the tricks.
> >> * routines that return before any safepoint needn't care about all that.
> > 
> > True, but we don't want to be recompiling/relinking any place we can 
> > avoid it, so a solution that minimizes computations is obviously 
> > desirable.
>    Yes, but my point is precisely that we can't know which solution
> minimizes computations unless we experiment them all,
> which has never been done yet. Which is why we be modular enough
> to allow all those tricks to be tried.

We'll have to see what can be done.

[obstacks vs. fixed buffers]
>    That's exactly what obstacks do. The disadvantage of obstacks
> is their code bulk; their advantage is you don't have to care about
> growing objects beyond predicted limits, and surviving safe points.
> I think the advantage far outweights the disadvantage,
> all the more when it allows easy escapes to the full power of the HLL,
> whereas fixed size buffers just don't.
>    As for efficiency, I recall you that if you guessed the right size,
> the obstack behaves just like your fixed size buffer.
> And relocation information may be useful for a moving GC,
> or producing object files anyway, enabling cross-compiling,
> deferred-linking, etc.
>    Really, there's little to gain with going away from relocation, I think.

You've sold me on the obstacks heap.

I have a minimal Forth core almost ready to go, running over Linux, for 
debugging purposes. I'll try to get it ready for posting, so everyone 
interested can check it out.