Tunes LLL/i386

Francois-Rene Rideau rideau@ens.fr
Thu, 4 Apr 1996 05:39:53 +0200 (MET DST)


Dear Utsl,
   here is my reply to your last mail.
Please note that contrarily to the osdev@pobox.com mailing list,
Tunes requires that you group-reply messages
so your reply be sent to the Tunes mailing-list,
while simple reply just sends to the replied message author.
   I bounced to the list the message I am answering to.


>>> VERY LOW LEVEL CONVENTIONS:

>> like putting the environment (your edi stuff)
>> in a fixed segment or page whose descriptor would be modified,
>
>[thread-unique, heap-allocated, minimum needed]
> I think I had 5 cells in there, which is 20 bytes.
   Good.

> Maybe we can use a pointer at a global address, which contains a value
> equivalent to what I have in edi, and change it with every context
> switch. This would acheive the same essential results, but without using edi.
> Good enough?
   Good. Let's take care not to put it in a heavily-used cache line
(like beginning-of-page or end-of-page)...
   Anyway, we shall try to make most of the code independent from whatever
hack we use here, by systematically use access-macros,
both for read and for write.


>>    Also, swapping esi and esp as stack and code pointers
>> could be an interesting experience, etc.
>
> I mentioned swapping ebp with some other reg. Why swap esi and esp? I
> can't see what this would do.
   ESP is fast. ESP as code means RET-based threading,
which is faster than any other threading method
(well, it makes things more complicated at other places,
and we must push/mov code on stack before to run it).
I'm not sure it's a gain,
but why not make code modular enough so we can try that trick later ?
I'd love to know how it modifies time/space performance...


>>[shared stack/heap register]
> [...] You mean something like alloca?
   What I meant was that local variables and parameters could be
passed on the heap space being used as stack space.
Because we use cooperative multithreading
(i.e. real-time threads cannot allocate on the shared heap),
we can safely use this unallocated heap-space as a scratch pad.
The only requirement is that safe points require that
proper heap frames be respected by procedures that yield execution
(that is, some overhead for non-critical code).

> [...] the advantages of using guard pages would go away
   Not at all ! Because the heap dynamically grows,
the stack guard page would be the same as the heap guard page !
Real-time threads, of course, would have a separate stack.

>>    You can find some more tricks in the tunes 0.0.0.25 files,
>> particularly the doc/draft document and i386 oriented files.





[Garbe Collection]
> Hmm. That will be complicated. We'll have to work together on that,
> because my experience doesn't cover that. It sounds like you're talking
> about memory objects with various attributes, in addition to size. A
> flags cell might take of this?
   Yes, we will have to. But first we must have a basis to work with,
that is, a heap-less FORTH system.

>>    Well, you need word-alignment for efficiency anyway.
> Right, I was planning on dword aligning all heap allocations. :)
   Hum. I meant 32-bit words (16-bit being half-words).
But you're right. I should have been clearer.
This to say we don't lose much addressing power with this GC restriction.

> Well, I'll try and implement without GC, to get things working. Then,
> once some of this stuff is up and running, we'll have to add it, since I
> don't know enough about how it works, and would rather debug it in a
> working environment, anyway.
   Exactly. However, be prepared to modify a lot of things
when GC arrives :(


>>[Take some Scheme system as an example]
> Ok. I've downloaded a copy of MIT-Scheme (5M!)
   Ouch ! MIT-Scheme might be a bad choice.
I would rather advise you have a look at elk,
which is cleanly written, has lots of clear documentation,
and is much simpler.


>>>[guard pages and multiple stacks]
>>    Guard pages are good.
>> Again, they show how allocating small or large stuff is different.
>> They can also be used for dynamically-grown stacks,
>> for flush-on overflow buffers, etc.
> Ok. So we use them for large items, but not for small ones?
No. We use guard-pages for small items, because the use/trap ratio
such that traps have little performance impact.
Large objects will use an explicit bound check.

> Problem is,
> with stacks you can't easily tell how large they possibly will get,
> making allocation on a heap risky.
That's why the heap dynamically grows,
and why GC is here to shrink it when it's too large.

> Stacks implemented with a 4k
> granularity, is admittedly, probably wasteful, but safer.
They should no doubt be used for real-time and priviledged threads.

> There are, of
> course, times where a 256 byte stack will be enough, but those aren't
> common enough, or as predictable.
   Threads can still allocate such space if they know what they do.
Not a high-priority feature, though.


>>    As for multiple stacks, I suggest that it be *possible* to do everything
>> using shared heaps/stacks as a *transient* space between safe points.
>> People who like to keep more information are free to allocate their own
>> heaps and stacks.
>
> Right, that just means being able to access the lower level memory
> functions which the heaps and stacks sit on top of. Not a problem.
> But Forth or Forth-like code, will still need both a data-stack and an
> r-stack to work properly. How those stacks are allocated, Forth won't
> care about.
   Good.

>>> Also, I think it is probably a good idea to require
>>> that stacks be protected
>>> from other threads, or assume that other threads will not modify them.
>>    Yeah of course, but for the above.
> Meaning only on page-granular stacks?
   Not *only* such stacks, but there sure should be facilities for
such stacks to exist, particularly for priviledged and real-time threads.


>>> Some questions about modules:
>>> How closely linked do we want the modules to be?
>>> [problems of efficiency vs possibility of migration]
>>    That's a very difficult question, all the more when migration is wanted.
>> As I see things, all would be kept abstract by the HLL,
>> and an interpreter would manage composition of objects at first.
>> That is, objects, including "code" would be structures on the heap,
>> and contain little code, but for some optimized routines.
>
> Are you talking about source code, or something like intermediate code?
> And this would be HLL standard behaviour?
   What I'm thinking about would be like intermediate code and/or
an interned (preparsed) version of source code,
which indeed would be standard HLL behavior.
Facilities would of course be provided in the LLL to hand-build
and hand-access such constructs, so that the HLL and the LLL
live in tight cooperation.


>> Then, as they are composed, objects get optimized,
>> and compiled versions are cached, that are "closely linked".
>> But because a "loosely linked" version still exists,
>> there is no problem with persistency, as it's just a matter
>> of discarding cached versions.
>
> What you're talking about sounds like compiled objects are disposable, while
> source/intermediate objects would always be retained. This seems to be
> saying that the lowest migratable object is LLL source. Am I correct?
   Well, there could be cooked, preparsed, ready-to-link,
partially optimized versions of LLL objects available.
My point is that if an object is too low-level to migrate easily,
get a higher-level version instead,
that you may or not respecialize thereafter.
What "too low-level" is depends on what kind of migration is done.
Surely I won't migrate i386 assembly code to a SPARC node on the network,
but use a higher-level version for the object.



>>> LANGUAGE DETAILS

>>> [thread-specific data]
>>    I'm not sure about implementing thread-specific data as a structure
>> of fixed type. Each thread should have its environment specifically
>> adapted to its activity, and common fields should be only the essential
>> ones. Perhaps one field could be a link to a dictionary for self-extension.
>> But for instance a variable like BASE sure doesn't have its place among
>> common thread-specific variables. Instead, procedures to print integers
>> with parameter base would take the parameter from the stack,
>> and the HLL (perhaps self-extension to the LLL) could perhaps
>> eventually implicitly push from a thread-specific variable.
>
> Right. I don't see BASE as being important. (In fact, I was thinking in
> terms of throwing away the complicated multiple base support, and support
> both hex and decimal through separate words.) But I wasn't exactly
> stating I wanted a fixed structure. :) Just that there should be at least
> one, a convenient way to get at it, and a handful of really
> useful/necessary things in it. Of course, careful consideration must go
> into what actually would be included. A dictionary pointer would be 
> needed, just to get the whole thing to work. A path (see below) of some
> kind would be nice. Stack base pointers would be helpful. But no, I
> wouldn't include things like>IN. Obsolete Forth things wouldn't be in
> LLL, anyway.
   Good.


> [vocabularies structured like directories]
>>    Yes. Path access is so useful it must be possible
>> *and* have a simple syntax. Something both FORTH and Scheme lack.
> Good. I'm glad you like it.
   Well, got used to it while using Pascal then C.

>>    Nested definitions require scoping, meta definitions require quoting.
>> Scheme does it quite right (well, there could be progress),
>> and any metasystem should get inspiration from there.
>
> Well, I've looked at it, and it doesn't look like something I want to do
> in LLL. Very complicated to do there. In HLL, it could be "simply" mapped
> to the appropriate words in the LLL, with no serious need to actual nest
> words, since scoping would have been resolved already. Am I right?
   Right.

>>> Another language detail, I've been toying with, is compiling words in a
>>> buffer, then copying the buffer into a heap object.
>>   We sure should make it modular where code is produced by compiling words.
>
> Well, compiling words is needed somewhere, and the standard Forth HERE
> contraption is totally incompatible with a standard heap. Doing it in a
> buffer allows the copied output to be allocated exactly the right size
> object in the heap.
   Let's do it with special code primitives,
with an interface like GNU's obstacks (that we could steal).


>>> Of course,
>>> this buffer would need to be separate for each thread.
>>    Not at all, it could be done on heap objects that be freed when we're
>> done, or on transient space between safe points.
>
> Right, that's just the location of it. I should have said separate for
> each compilation thread, so that if more than one compile is being done,
> nothing gets eaten. Better?
   Yup.

[Relocation of Code words:
PIC can be done through relative and indirect addressing,
code relocation with complicated reloc descriptor,
or slow code recompilation is needed]

> Doing relocation here would be something I'd like to avoid.
   Yeah, it's such a mess. There should be primitives to produce
code-with-relocations, other to (un/re)apply-relocations, etc. Ouch.
However, I feel that we can't do without it,
unless we choose to discard and recompile (not necessarily in that order ;)
all movable low-level code during major code GC's (which is a solution).
I feel that both solutions should be offered to the LLL programmer,
because again, there are different kinds of code,
with different migration behavior.
   I think we should first implement relocation anyway,
because dynamic linking requires it anyway.
Of course, as you suggest elsewhere,
we could make LLL source the standard "object file",
and support dynamic recompilation rather than relocation,
at least for a moment.
Having only recompilation and no relocation means
that we must know the destination address before we compile,
and allocate compiled code size before next safe point.


> [Relative addressing] possible, given the esi=word base pointer convention
   Well, any general register would do equally (e.g. esp or ebp).

> [Indirect addressing]
   The biggest problem about movable code is *return addresses*
(and more generally continuations, as a return address is just that).
If these are to be relocated, this means some costly behavior,
as we must be able to track down the beginning of any return code chunk.
   The first way to do it is to have some complicated call/receive
convention, so that return addresses always begin a code chunk,
or that some indirect or base+offset return address be used.
   Another way is to have some way to retrieve the beginning
of a chunk of memory, but that means support in the GC algorithm,
so either the GC always does it (=cost), or we add a case to it (=cost).
   As you see, we never get something for nothing.
My hope is that we can eventually test all possible tradeoffs,
and choose the best measured system,
while allowing people to recompile Tunes with whichever they prefer.
Every address space could have its own GC management, too.

> In compiled x86 binary, it would be about as elegant
> as an elephant playing piano, but that's normal for x86.
   (-8 (-8

   Again, remember that:
* indirect addressing (including relative addr) is just *runtime linking*,
 so that to have a fair idea of the tradeoff,
 we should measure total link+execution time for both the tricks.
* routines that return before any safepoint needn't care about all that.



>>>[Caring for large ALLOTs]
>>    It is a well-known solution to provide different solutions to
>> qualitatively different problems: large objects generally have completely
>> different GC behavior than small objects, thus must be treated differently.
>> Specialized words will be provided for either, as well as perhaps
>> a generic multiplexing instance.
>
> As long as Forth compatibility in LLL isn't a major issue, no problem.
> I'm just saying that ALLOT wouldn't be able to be used quite as freely as
> in Forth, being only usable for small amounts of data. 1000 ALLOT would
> probably bomb, unless the compile buffer is capable of dynamic expansion,
> which seems like unnecessary complication.

   Well, why have such restriction ?
Again, something like GNU obstack may be used:
you accumulate data on a backward linked list of buffers,
and when it's done, it is compacted into a one contiguous buffer.


>>> A side issue to consider: could the HLL be reasonably implemented as a
>>> close superset of the LLL, such that rather than having a clear two level
>>> system, we could allow programs to be written at arbitrary level.
>>    That was my original idea.
>> However, for cleaner semantics, and not to have to cope with
>> early implementation bugs,
>> we are currently bootstrapping the HLL from Scheme.
> 
> Fine, if this is still the eventual goal.
   It is.


>>> Why build a dichotomy of between the two languages?
>> Because, in a first time, it allows independent development of them.
>> When things are settled down and stable, they can hopefully be merged,
>> unless good reasons appear against it.
> 
> We'll just have to try to prevent those "good reasons" from appearing. :-)
   Yes, we'll try hard.


>>> SYSTEM CONSIDERATIONS

>>> But there will need to be some accounting of resource ownership,
>>> and a design decision, as to what level this is handled at.
>>> The HLL could do this, but I'd like to program in
>>> the LLL, and that would screw things up for me. (Personal gripe.)
>>    Until the HLL eventually does it, some LLL code must be provided,
>> and can be used. Of course, the HLL will hopefully make it far easier,
>> while allowing people who really want to to have the same expressivity
>> as LLL allows, by having more or less elaborate hooks and plugs
>> into the LLL.
>
> Allow calling back up into HLL from LLL, you mean? Good.
   Yup: the HLL and LLL can call each other.

> [resource ownership by what sort of objects?]
>>    I'd rather make things orthogonal:
>> there are objects than have access to other objects,
>> among which are unique resources.
>
> I was only suggesting that to describe the problem... I assumed you had
> some other idea.


> [using segmentation/protection]
>>    I've been wondering about the feasability and expediency
>> of writing everything in Ring 0.
>
> My experience has been that it doesn't work nicely. The x86 just doesn't
> seem to be designed for it. Everything works, except exception handlers, 
> which break badly, and if there is every a stack fault, it will probably 
> shut the machine down. So guard pages, and a lot of other VM tricks would
> have to go away. Better idea would be to keep a minimal ring 0 section,
> with very precisely defined tasks, controlled by the rest of the system
> running in ring 3.
> Btw, I've tried to do it that way, it just won't work for a featureful
> system. :( :( :( They _do_ call it protected mode.
   Hum. So it looked to me.
However, perhaps making descriptor tables (GDT, LDT, IDT, PDT, IOMAP, etc)
available to Ring 3 (or 1, or 2, if you prefer) would allow to drastically
reduce costly level-switching during mmap()ing and other operations,
by running most everything in "almost Ring 0" world,
while preserving the ability to trap guard pages.


>>> [Using segments and pointer relocation: a nightmare]
>>
>>    We could use segments only for objects without moving allocation inside,
>> or objects in general where the overhead for GC is small before the
>> advantages of the segment stuff.

> Still remains the hassle of generating code to deal with it. Best if
> these sorts of objects aren't common. However, code to manipulate the
> DT's is probably necessary.
   Of course such things won't be common, as there is a lot of overhead.
It is a low-priority feature, and we needn't support it at first.
But we just shouldn't preclude it from future development.


>>> At the same time, it has its own complications. A flat model could
>>> be done with single or multiple address spaces, and with or without page
>>> protection.
>
> I'm talking about the difference between a completely linear model, with
> flat segments and a virtual address model. Both use flat segments, but
> not necessarily flat addressing. I favour writing and debugging code to
> handle it, but leaving the choice of whether to use it up to the HLL.
   If objects are in different address spaces,
it means that they do not interact much (else, objects should be migrated),
so as to minimize barrier-crossing overhead.
But barrier-crossing among different address spaces sure should be possible:
special (costly) objects in a space can point to objects in the other.
That's how file-systems and persistent stores will be handled, for instance
(the usual "swap-space" being only one persistent store among other ones).


>>>[Page versus segments]
>>    And [pages] have larger grain.
>> Perhaps both ways may be provided for different purposes.
> Probably a good idea, I just have this vendetta against segments. You'll
> get used to that.
   One of my early projects was to have one segment per object
(for protection), until I realized how slow, wasteful, etc, it could be.
Now, I know that segments are typically the kind of thing that wastes
chip space and slows down the whole computing world.
   Moreover, pages are far more portable,
so let's focus on them, at least to begin with.
Usually, fs and gs could just be used as thread pointer
and other such stuff, if expedient.


>>>[Independence of some LLL and HLL features with regard to the other L]
>>    As I said on the other mailing list,
>> the low-level part of the OS should provide as expedient means as possible
>> to achieve operations of well-known semantics.
>> It is then up to the high-level part to coordinate them.
>> The low-level part should make as little scheduling decisions,
>> contain as little heuristics as possible,
>> and not overspecify any part of the OS.
>
> True, but some things obviously shouldn't be contained within the view of
> every object. Large portions of the system should neither know nor care
> about segments vs. pages, just memory that they use. Right? So I would
> think, given that you have also the OTOP subproject, that there will be
> at least a few things that can be diffentiated only in LLL, and ignored
> by the HLL, simply because they're different, but not important. This
> would be things genuinely unimportant to the HLL.
   Yes: HLL programmers don't have to care about platform-specific stuff.
Yet, they *can* access it (hence in a non-portable way),
and the HLL *may* be used to write the heuristics that LLL services need.

> Important differences would need to be dealt with, such as the fact that 
> OTOP won't be able to remap pages easily like raw x86 code can. But will
> everything fall into that catagory?
   Yup. There will be lot of platform-dependent stuff.
Let's focus on the x86, and not on OTOP,
so we have more freedom in low-level design.
Porting to OTOP might be made difficult,
but OTOP needn't any kind of high-performance,
and we could possibly replace system memory faulting
with software read/write barriers.
Also, newer "microkernel" implementations of POSIXish stuff
often allow software pagers to be used...


>>    I have stated many possible optimizations in this message.
>> However, the highest priority is to have something simple working,
>> even unoptimized (well, not pessimized either).
>> It is always time to optimize later,
>> when we can add high-level support for low-level tricks at the same time.
>> Optimizing at once only makes it more
>> difficult for both implementor and user...
>
> True, but it can save time. Also, I'm lazy, and like to do work only
> once, when possible. And although many issues I've asked about aren't
> short term worries, it'll help me to know the needs of the HLL sitting on
> top of LLL, even though I don't need to know exactly how the HLL will work.
> If I just write a Forth, it wouldn't help you much, would it? Knowing
> more about the middle of the system (where HLL and LLL meet) will help me
> to know what I need to do in the LLL.
   I fully agree. I just wanted to make sure that we wouldn't embark
in writing overcomplicated things at first,
and focus on having something work that could be extended later.
Surely, knowing what kind of further extensions are meant
can save a lot of time.
What I meant is that at first, we must implement as little as possible,
just enough so that it works,
while following a modular style allows easy extension,
and providing hooks whereever needed.

--    ,        	                                ,           _ v    ~  ^  --
-- Fare -- rideau@clipper.ens.fr -- Francois-Rene Rideau -- +)ang-Vu Ban --
--                                      '                   / .          --
Join the TUNES project for a computing system based on computing freedom !
		   TUNES is a Useful, Not Expedient System
WWW page at URL: "http://www.eleves.ens.fr:8080/home/rideau/Tunes/"