Tunes LLL/i386

Nathan Hawkins
Thu, 4 Apr 1996 08:02:10 -0500 (EST)

On Thu, 4 Apr 1996, Francois-Rene Rideau wrote:
> Dear Utsl,
> Please note that contrarily to the mailing list,
> Tunes requires that you group-reply messages
> so your reply be sent to the Tunes mailing-list,
> while simple reply just sends to the replied message author.

Oops. Sorry.

> > Maybe we can use a pointer at a global address, which contains a value
> > equivalent to what I have in edi, and change it with every context
> > switch. This would acheive the same essential results, but without using edi.
> > Good enough?
>    Good. Let's take care not to put it in a heavily-used cache line
> (like beginning-of-page or end-of-page)...
>    Anyway, we shall try to make most of the code independent from whatever
> hack we use here, by systematically use access-macros,
> both for read and for write.

Yes, and preferably find a means to make it more of a mid-level 
construct, being a pointer to some object with a static structure stuck in 
front, as well as run-time, thread-specific/unique data/pointers. 
Presumably, the LLL compiler would use a tiny part of this (the static 
part), but HLL would be able to extend it, obviously with words defined
for the purpose.  Sound good?

> >>    Also, swapping esi and esp as stack and code pointers
> >> could be an interesting experience, etc.
> >
> > I mentioned swapping ebp with some other reg. Why swap esi and esp? I
> > can't see what this would do.
>    ESP is fast. ESP as code means RET-based threading,
> which is faster than any other threading method
> (well, it makes things more complicated at other places,
> and we must push/mov code on stack before to run it).
> I'm not sure it's a gain,
> but why not make code modular enough so we can try that trick later ?
> I'd love to know how it modifies time/space performance...

Ah, I see. I'll have to think about that a bit. Interesting idea, with 
"complicated" implications. As for modular coding: I'm still trying to 
figure out what kind of architecture to use, whether to go with a 
straight compiler, or use a threaded interpreter, and write a compiler 

> >>[shared stack/heap register]
> > [...] You mean something like alloca?
>    What I meant was that local variables and parameters could be
> passed on the heap space being used as stack space.
> Because we use cooperative multithreading
> (i.e. real-time threads cannot allocate on the shared heap),
> we can safely use this unallocated heap-space as a scratch pad.
> The only requirement is that safe points require that
> proper heap frames be respected by procedures that yield execution
> (that is, some overhead for non-critical code).
> > [...] the advantages of using guard pages would go away
>    Not at all ! Because the heap dynamically grows,
> the stack guard page would be the same as the heap guard page !
> Real-time threads, of course, would have a separate stack.

Sounds like you're talking about backwards growing heaps. So you're saying
tangle the stack and the data together? Won't that make the GC even harder
to do?  I would have thought separation of data and stack would actually 
help. What you're describing is a lot like the GNU obstacks idea. While 
we could certainly do that, I would think it would get extrordinarily 
complicated to mix a regular stack in with it. I'll have to think about 
that one a while.

> [Garbe Collection]
> > Hmm. That will be complicated. We'll have to work together on that,
> > because my experience doesn't cover that. It sounds like you're talking
> > about memory objects with various attributes, in addition to size. A
> > flags cell might take of this?
>    Yes, we will have to. But first we must have a basis to work with,
> that is, a heap-less FORTH system.

8( That makes things a lot harder... I'd rather you had said, 
"replacable-heap Forth-like system." I can't do much without some kind of 
heap. On the other hand, I _can_ write the rest of the system so that the 
GC-less heap code can be replaced. (I think.)

> >>    Well, you need word-alignment for efficiency anyway.
> > Right, I was planning on dword aligning all heap allocations. :)
>    Hum. I meant 32-bit words (16-bit being half-words).
> But you're right. I should have been clearer.
> This to say we don't lose much addressing power with this GC restriction.

Just makes strings harder. :-( Again, I'll have to think about it, which 
I can't do very well right now. (I have 80+ e-mails to read/answer. I'm 
on too many mailing lists. Maybe I'll drop Roland's OS list... ;)

> > Well, I'll try and implement without GC, to get things working. Then,
> > once some of this stuff is up and running, we'll have to add it, since I
> > don't know enough about how it works, and would rather debug it in a
> > working environment, anyway.
>    Exactly. However, be prepared to modify a lot of things
> when GC arrives :(

That's the problem, I don't quite know which things will need to be 
modified. Otherwise, it wouldn't be a problem. Maybe I should wait, study 
how GC is done, and then write the GC into the heap code from the start. 
Then I'll know exactly how it works, and therefore know what it will break.

> >>[Take some Scheme system as an example]
> > Ok. I've downloaded a copy of MIT-Scheme (5M!)
>    Ouch ! MIT-Scheme might be a bad choice.
> I would rather advise you have a look at elk,
> which is cleanly written, has lots of clear documentation,
> and is much simpler.

I'll go get that. Think I'll dump MIT then. Hd overflow error...

> >>>[guard pages and multiple stacks]
> >>    Guard pages are good.
> >> Again, they show how allocating small or large stuff is different.
> >> They can also be used for dynamically-grown stacks,
> >> for flush-on overflow buffers, etc.
> > Ok. So we use them for large items, but not for small ones?
> No. We use guard-pages for small items, because the use/trap ratio
> such that traps have little performance impact.
> Large objects will use an explicit bound check.

I'm confused... I would have thought the reverse: large stacks (or 
expected large) would get 4k page granularity, but (with guard pages) 
would be prevented from overflowing into other things. Small stacks 
(known/predicted likely to stay small) would be allocated on a heap 
somewhere and wouldn't grow at all.

Please try and unconfuse me.

> > Problem is,
> > with stacks you can't easily tell how large they possibly will get,
> > making allocation on a heap risky.
> That's why the heap dynamically grows,
> and why GC is here to shrink it when it's too large.

But stacks don't move that well, and when they overflow into other 
objects, it's usually too late to fix them. Now if stacks are grown away 
from other objects, they won't be as likely to break, right?

> > Stacks implemented with a 4k
> > granularity, is admittedly, probably wasteful, but safer.
> They should no doubt be used for real-time and priviledged threads.
> > There are, of
> > course, times where a 256 byte stack will be enough, but those aren't
> > common enough, or as predictable.
>    Threads can still allocate such space if they know what they do.
> Not a high-priority feature, though.

Which space? Small, fixed size heap stack, or larger granularity stack? 
Need to know which would be lower priority feature.

> >>> Also, I think it is probably a good idea to require
> >>> that stacks be protected
> >>> from other threads, or assume that other threads will not modify them.
> >>    Yeah of course, but for the above.
> > Meaning only on page-granular stacks?
>    Not *only* such stacks, but there sure should be facilities for
> such stacks to exist, particularly for priviledged and real-time threads.

Hmm. That was really vague. (Both of use.) Page based stacks (with guard 
pages) can be protected with the paging hardware, if this were desired. For 
a few very volatile tasks, such as critical device drivers, this could be 
useful, since high level guarrantees are harder to enforce when 
interrupts are involved. Now how you would do anything similar for any other
kind of stack escapes me. Unless you're referring to high-level software
security type of protection. Can't see how you'd protect a stack in the 
middle (un-page-aligned) of a heap.

> >>> [problems of efficiency vs possibility of migration]
> >>    That's a very difficult question, all the more when migration is wanted.
> >> As I see things, all would be kept abstract by the HLL,
> >> and an interpreter would manage composition of objects at first.
> >> That is, objects, including "code" would be structures on the heap,
> >> and contain little code, but for some optimized routines.
> > Are you talking about source code, or something like intermediate code?
> > And this would be HLL standard behaviour?
>    What I'm thinking about would be like intermediate code and/or
> an interned (preparsed) version of source code,
> which indeed would be standard HLL behavior.
> Facilities would of course be provided in the LLL to hand-build
> and hand-access such constructs, so that the HLL and the LLL
> live in tight cooperation.

This sounds like where I was talking about a replacable parser. In other 
words, allow the HLL to tell the LLL compiler what words it wants to 
compile directly, without using the LLL parser code, or even FIND.
Let the HLL create its syntax tree, or whatever, and when it wants to 
compile, call the compiler word with a pointer (which the HLL can obtain) 
and let the LLL compiler do the ->binary work. The HLL can then do 
optimizations with regard to choice of words, and factor, inline or 
whatever to suit.
This would essentially allow multiple interfaces to be built on top of 
the same LLL compiler.

> > What you're talking about sounds like compiled objects are disposable, while
> > source/intermediate objects would always be retained. This seems to be
> > saying that the lowest migratable object is LLL source. Am I correct?
>    Well, there could be cooked, preparsed, ready-to-link,
> partially optimized versions of LLL objects available.
> My point is that if an object is too low-level to migrate easily,
> get a higher-level version instead,
> that you may or not respecialize thereafter.
> What "too low-level" is depends on what kind of migration is done.
> Surely I won't migrate i386 assembly code to a SPARC node on the network,
> but use a higher-level version for the object.

So a predigested LLL source in some HLL internal format, which can be fed 
into a LLL compiler would be migratable? (See preceding section.)
And no, no sane person would try to migrate x86 -> Sparc. Not even I 
would want to do that. ;)

> >>> Another language detail, I've been toying with, is compiling words in a
> >>> buffer, then copying the buffer into a heap object.
> >>   We sure should make it modular where code is produced by compiling words.
> >
> > Well, compiling words is needed somewhere, and the standard Forth HERE
> > contraption is totally incompatible with a standard heap. Doing it in a
> > buffer allows the copied output to be allocated exactly the right size
> > object in the heap.
>    Let's do it with special code primitives,
> with an interface like GNU's obstacks (that we could steal).

Obstacks is not only written in C, it's also a mess (for C). 
Reimplementing is probably a good idea, sadly enough. Anyway, I don't get 
this: why would we allocate space for compiled words using Obstacks? I 
wanted to use a standard (albeit GC'ed) heap, with allocations being made 
after the exact size of the compiled object is determined. I was actually 
toying with a heap implementation that allocates by dwords. It doesn't 
seem to be practical, though.

> [Relocation of Code words:
> PIC can be done through relative and indirect addressing,
> code relocation with complicated reloc descriptor,
> or slow code recompilation is needed]
> > Doing relocation here would be something I'd like to avoid.
>    Yeah, it's such a mess. There should be primitives to produce
> code-with-relocations, other to (un/re)apply-relocations, etc. Ouch.
> However, I feel that we can't do without it,
> unless we choose to discard and recompile (not necessarily in that order ;)
> all movable low-level code during major code GC's (which is a solution).
> I feel that both solutions should be offered to the LLL programmer,
> because again, there are different kinds of code,
> with different migration behavior.
>    I think we should first implement relocation anyway,
> because dynamic linking requires it anyway.
> Of course, as you suggest elsewhere,
> we could make LLL source the standard "object file",
> and support dynamic recompilation rather than relocation,
> at least for a moment.
> Having only recompilation and no relocation means
> that we must know the destination address before we compile,
> and allocate compiled code size before next safe point.
> > [Relative addressing] possible, given the esi=word base pointer convention
>    Well, any general register would do equally (e.g. esp or ebp).

esp for word base pointer?! Not in an x86 binary! Threaded code maybe, 
but it would still be complicated.
Words can be made PIC within themselves. It's addressing other words, and 
subsequently being linked to that is the problem. Now there are only 
two solutions that I can see to that one:

1. use absolute addressing, and never move anything.
2. use some kind of table of pointers. threading would have to be done 
through the pointer table. GC could take advantage of this, however. :)

> > [Indirect addressing]
>    The biggest problem about movable code is *return addresses*
> (and more generally continuations, as a return address is just that).
> If these are to be relocated, this means some costly behavior,
> as we must be able to track down the beginning of any return code chunk.
>    The first way to do it is to have some complicated call/receive
> convention, so that return addresses always begin a code chunk,
> or that some indirect or base+offset return address be used.
>    Another way is to have some way to retrieve the beginning
> of a chunk of memory, but that means support in the GC algorithm,
> so either the GC always does it (=cost), or we add a case to it (=cost).
>    As you see, we never get something for nothing.
> My hope is that we can eventually test all possible tradeoffs,
> and choose the best measured system,
> while allowing people to recompile Tunes with whichever they prefer.
> Every address space could have its own GC management, too.

I'll have to think about this one a while. Maybe I'll think of something 
brilliant. (It would be nice...)

>    Again, remember that:
> * indirect addressing (including relative addr) is just *runtime linking*,
>  so that to have a fair idea of the tradeoff,
>  we should measure total link+execution time for both the tricks.
> * routines that return before any safepoint needn't care about all that.

True, but we don't want to be recompiling/relinking any place we can 
avoid it, so a solution that minimizes computations is obviously 

> >>>[Caring for large ALLOTs]
> >>    It is a well-known solution to provide different solutions to
> >> qualitatively different problems: large objects generally have completely
> >> different GC behavior than small objects, thus must be treated differently.
> >> Specialized words will be provided for either, as well as perhaps
> >> a generic multiplexing instance.
> > As long as Forth compatibility in LLL isn't a major issue, no problem.
> > I'm just saying that ALLOT wouldn't be able to be used quite as freely as
> > in Forth, being only usable for small amounts of data. 1000 ALLOT would
> > probably bomb, unless the compile buffer is capable of dynamic expansion,
> > which seems like unnecessary complication.
>    Well, why have such restriction ?
> Again, something like GNU obstack may be used:
> you accumulate data on a backward linked list of buffers,
> and when it's done, it is compacted into a one contiguous buffer.

Because this would mean we'd have to compile and link/relocate in two 
stages, which I was trying to avoid. I was looking for a way to go 
straight from a definition to a binary, which, once complete could be 
copied to a allocated space, and would then work, in place. Preferably 
with minimal overhead/complications. (Nice, clean, fast working compiler, 
which would then be easier to modify/experiment with.)

> > [using segmentation/protection]
> >>    I've been wondering about the feasability and expediency
> >> of writing everything in Ring 0.
> > My experience has been that it doesn't work nicely. The x86 just doesn't
> > seem to be designed for it. Everything works, except exception handlers, 
> > which break badly, and if there is every a stack fault, it will probably 
> > shut the machine down. So guard pages, and a lot of other VM tricks would
> > have to go away. Better idea would be to keep a minimal ring 0 section,
> > with very precisely defined tasks, controlled by the rest of the system
> > running in ring 3.
> > Btw, I've tried to do it that way, it just won't work for a featureful
> > system. :( :( :( They _do_ call it protected mode.
>    Hum. So it looked to me.
> However, perhaps making descriptor tables (GDT, LDT, IDT, PDT, IOMAP, etc)
> available to Ring 3 (or 1, or 2, if you prefer) would allow to drastically
> reduce costly level-switching during mmap()ing and other operations,
> by running most everything in "almost Ring 0" world,
> while preserving the ability to trap guard pages.

So we'd only do interrupts in ring 0? That might work. Do that part in 
assembly, and reflect into real-time threads, which would probably have 
to work from a static heap or something. :( :( :( Oh well, C and assembly 
have the same problem, only worse.

[flat model...]
> >>> At the same time, it has its own complications. A flat model could
> >>> be done with single or multiple address spaces, and with or without page
> >>> protection.
> >
> > I'm talking about the difference between a completely linear model, with
> > flat segments and a virtual address model. Both use flat segments, but
> > not necessarily flat addressing. I favour writing and debugging code to
> > handle it, but leaving the choice of whether to use it up to the HLL.
>    If objects are in different address spaces,
> it means that they do not interact much (else, objects should be migrated),
> so as to minimize barrier-crossing overhead.
> But barrier-crossing among different address spaces sure should be possible:
> special (costly) objects in a space can point to objects in the other.
> That's how file-systems and persistent stores will be handled, for instance
> (the usual "swap-space" being only one persistent store among other ones).

Also, things like page fault and hardware interrupt handlers might work 
better that way, since they would have to deal with more restricted heap 
behaviour, anyway.

> >>>[Page versus segments]
> >>    And [pages] have larger grain.
> >> Perhaps both ways may be provided for different purposes.
> > Probably a good idea, I just have this vendetta against segments. You'll
> > get used to that.
>    One of my early projects was to have one segment per object
> (for protection), until I realized how slow, wasteful, etc, it could be.
> Now, I know that segments are typically the kind of thing that wastes
> chip space and slows down the whole computing world.
>    Moreover, pages are far more portable,
> so let's focus on them, at least to begin with.
> Usually, fs and gs could just be used as thread pointer
> and other such stuff, if expedient.

I don't quite understand. Thread pointer? You mean threads as in 
processes and threads, not like in threaded code, right?
That could be done, but would be probably be added later.

> >>>[Independence of some LLL and HLL features with regard to the other L]
> > True, but some things obviously shouldn't be contained within the view of
> > every object. Large portions of the system should neither know nor care
> > about segments vs. pages, just memory that they use. Right? So I would
> > think, given that you have also the OTOP subproject, that there will be
> > at least a few things that can be diffentiated only in LLL, and ignored
> > by the HLL, simply because they're different, but not important. This
> > would be things genuinely unimportant to the HLL.
>    Yes: HLL programmers don't have to care about platform-specific stuff.
> Yet, they *can* access it (hence in a non-portable way),
> and the HLL *may* be used to write the heuristics that LLL services need.

Exactly: policy vs. mechanism.

> > Important differences would need to be dealt with, such as the fact that 
> > OTOP won't be able to remap pages easily like raw x86 code can. But will
> > everything fall into that catagory?
>    Yup. There will be lot of platform-dependent stuff.
> Let's focus on the x86, and not on OTOP,
> so we have more freedom in low-level design.
> Porting to OTOP might be made difficult,
> but OTOP needn't any kind of high-performance,
> and we could possibly replace system memory faulting
> with software read/write barriers.
> Also, newer "microkernel" implementations of POSIXish stuff
> often allow software pagers to be used...

Something like Tunes over Mach? Or Tunes over Linux (with Linux specific 
Anyway, I'll certainly concentrate on the LLL/x86 subproject first, but 
I'm going to try to write and debug some portions under other 
environments, such as Linux, and (possibly) Win32Forth.

[LLL optimizing]
> > True, but it can save time. Also, I'm lazy, and like to do work only
> > once, when possible. And although many issues I've asked about aren't
> > short term worries, it'll help me to know the needs of the HLL sitting on
> > top of LLL, even though I don't need to know exactly how the HLL will work.
> > If I just write a Forth, it wouldn't help you much, would it? Knowing
> > more about the middle of the system (where HLL and LLL meet) will help me
> > to know what I need to do in the LLL.
>    I fully agree. I just wanted to make sure that we wouldn't embark
> in writing overcomplicated things at first,
> and focus on having something work that could be extended later.
> Surely, knowing what kind of further extensions are meant
> can save a lot of time.
> What I meant is that at first, we must implement as little as possible,
> just enough so that it works,
> while following a modular style allows easy extension,
> and providing hooks whereever needed.

Right. But all this is helping immensely. Once I study GC a little more, 
I should be able to get some coding done, I think. (When I can get some 
time again.)