Could we use multiple stacks?

Francois-Rene Rideau rideau@ens.fr
Fri, 24 May 1996 13:55:35 +0200 (MET DST)


[Follow-ups to tunes-lll]

Dear Sean,
  please do read the docs about GC and obstacks.
For GC, read the FAQ,
http://www.centerline.com/people/chase/GC/GC-faq.html
For GNU obstack, just do read /usr/include/obstack.h on any linux box
(I can send it to you), and/or read the obstack.[hc] in the GNU libc.

Notes:
   to me, a thread is a unit of execution.
It may consist in just what you say, or be much larger,
include lots of state, etc.
Unix processes are threads to me. Just coarse-grained threads.
   I hope we're not in for word-fight.

>   Now, you assume that GC between threads is impossible (or you don't know
> how to do it) in a preemptive system.
   Compacting GC is indeed impossible in a non-cooperative system.

> The only difference between a
> cooperative and a preemptive system is how other threads gain execution
> time.
> In a cooperative, they voluntarily give up the execution time,
> whereas in a preemptive system, it is forcibly removed from them.
This "mere" difference implies that only cooperative threads can decide
WHEN to yield execution, so that only when the thread is in a safe state
can GC or other migration service happen that require strong consistency
constraints.

> The other
> difference may be the amount of state to be saved, but that's not important
> right now.
It's also a great performance factor.


>   So, we have three threads, Larry, Curly and Moe. [...]
>
>   Does the above change (or in any other way, differ) if the memory space
> between Larry and Curly are shared?  I would assume that if they are shared,
> then doing a GC on Larry will also affect Curly at the same time.
>
They are shared, hence

>   If they differ, then a copy of the memory block is put into Curly's space
> or that block is a segment of shared memory between the two.  Does that
> change anything?
>
*SOMETIMES*, you know your thread does not interact much with others,
so you can avoid to share much GCable data with other threads. Great!
Now, you just can't write an OS by always assuming "best case" situations.
   No thread has any interest if it does not interact somehow
with the external world, hence with other threads
(unless it manages all I/O directly and doesn't really need an OS).
Thus, all threads must abide by GC requirements, even those that
will give little work to the GC.

>   If I'm wrong, please tell me how it DOES work.
It works because when a GC is needed, I know that all threads
are in a satisfactory state. No temporary calculation is done
that will break its assumptions (e.g. registers with non-compliant
values, uninitialized data on heap, etc).

>   Preemptive system: [...]
> 
>   At some point, Tunes comes along and says, Enough!  Let me Collect the
> Garbage.  For each memory address space, all threads currently using that
> space are blocked, then the GC is run in the memory space.
>
>   Where's the problem?
>
>   Or if I'm wrong, please tell me how it DOESN'T work (or what I'm missing).
>
It doesn't work, because threads were preempted in a "dirty" state,
where they could not comply with the inter-thread GC requirements.
A pointer was used in %eax, which the GC does not recognize,
while a value was used in %ebx, during an intermediate calculation,
that the GC will think is a pointer. The shared heap will contain
uninitialized pointers, which will make the GC go to hell,
and Curly and Moe might be in a race for heap space.
Too bad.
   Of course, I could have semaphores for the GC, and require that
every thread will eventually get the GC semaphore,
so I can synchronize them all and the GC can begin.
Hey! That's precisely what is called "cooperative multithreading". Wow!

   BTW, as for efficiency, all your threads use kernel-based message passing,
which in the best case (L4), is ten times slower than procedure call,
which in turn can be much slower than inlined code.
   Of course, in a network of computers, message passing must be used anyway,
but I see no reason to give local communications the price of
international communications.


>> Real-Time and cooperative are NOT contradictory.
>>
>   And I claim they are.
>
We'll see.
Feel free to propose *and implement* a concurrent design for Tunes,
and/or develop another project. But then, you should rather take a look
at VSTa, QNX, and L4.


>> This just means that the time between the triggering event
>> and the execution of the handler is bounded.
>>    Cooperative threads can do this by either having bounded-time "recovery"
>> routines to execute before to jump into the handler,
>> or by ensuring that they will reach a safe point before next
>
>   Example:  My Real-Time requirements require that the handler for event A
> is executed within 10uSecs (10 microseconds).  Event sequence as follows:
>
> 	Thread B just passed a "check point".  It won't get to the next
> 	"check point" until another 50uSecs.
> 
> 	Event A happens.  I have 10uSecs (or less) to handle this situation,
> 	or something bad happens.
>
>   I can't just stop Thread B, because that would be ... oh ...
> PREEMPTION, now wouldn't it?
I can stop thread B, and it's called an interrupt.
The thread high priority interrupt-driven threads that preempt
the normal thread are not of equal priviledge,
but still, normal threads must cooperate with each other,
high-priority threads must all the more to cooperate with each other,
and communication between high-P and normal threads have
even more drastic requirements that threads must cooperate with.
That's why I called my proposal "semi-preemptive". If you have
a better name, give it. Still, threads must cooperate a lot.

> Or do you get around that by calling some
> special code in Thread B that causes it to PREEMPT itself?  Oh, real
> COOPERATIVE there.
If your interrupt-driven routine decides that B should terminate
and switch to C, then it will indeed call a routine that B defined,
that will have B reach next safe point in bounded time and switch thread.
The default routine is just to change a RET into NOP, and wait for
the thread to jump into. Other routines can be supplied.
If you won't call that cooperative, then please give me a better name.

>   Now, upping the "check point" to be every 10uSecs will be a good
> feat.  But assuming you can garuntee that (for instance, is this an option
> to the compiler?  Insert a "check point" every XuSecs?  What if the program
> was compiled for a Pentium and I run it on a slower 386?  Or it was compiled
> with X=50, but I need for X to be 10?
> Or is the code recompiled each time it's run?
If you're using a different computer each time you run it, then yes,
you'll have to recompile each time.
Time-critical routines are hardware-dependent.
Of course, compilation need not be done from source code,
and can be done lightning fast from specially adapted precompiled code.


>[after lots of fanciful figures]
>   To even hope for a response time of XuSecs in a cooperative system, you'll
> have to insert checks every X/2usecs (assuming you can), which doubles the
> overhead.
Exactly. So in optimized code, we only need have two checks per switch,
which ridiculous cost is more than covered by all the state that
switching will not have had to save as a preemptive switch would have to,
plus we gain the ability of having interthread GC.
   Please see that we're talking about human interaction response time,
then 0.01 seconds (10.000 Ás, 330000 cycles on a 33MHz machine,
which makes the 8 cycle overhead per check negligible).
As for device response time, then prioritized real-time threads will do it,
and their buffers will be managed in less than two cooperative cycles.
I don't see how you can be more real-time.
   Time-critical code can be optimized, which removes any problem.
Other code can live with yet another 1-5% overhead.


>>    Anyway Real-Time threads require cooperation *with each other*,
>> as they lose any interest if they are not guaranteed to execute
>> until completion. Of course they have priority over non-real-time threads,
>> and need not cooperate with these.
>>
>   So effectively, you HAVE a preemptive system.
>
Call it as you like it. I fsck names.

>> A kernel-based scheduler requires at least two task switches
>> (400 cycles), and won't allow interthread GC.
>>
>   How do you figure that?  The task switcher under AmigaDOS only required
> one task switch (to the new task), but that's because switching to kernel
> mode didn't require a task switch.
>


>>>>  = a "recovery" routine other than "let me run until next checkpoint"
>>>>   can be declared to the semi-preemptive scheduler,
>>>			   ^^^^^^^^^^^^^^^
>>>   Semi-preemptive?	Trying to have your cake and eat it too?
>>>
>> Not trying. Managing.
>> I don't know any standard word for what I'm doing,
>> so I invent my own.
>>
>   If each process calls yeild(), it's cooperative.  If a process (any
> process) is stopped by another process, it's preemptive.  Period.
>
Then I'd say it's at the same time cooperative and preemptive.
Say as you wish.

>   You seem to want rings, where threads within a ring are cooperative, but
> between rings are preemptive.
>
>   You therefore have a preemptive system.
>
If you say so, it must be true.
I just don't care about the name;
but it sure is not the same "preemptive system" as with Unix,
and threads are sure meant to "cooperate" in many ways.
I'm ready to say what you say, as long as you let me do as I want.

>>>> so that the program can reach a safe state
>>>> from anywhere in the tight loop.
>>>   Say what?  Under what conditions will a program
>>> have to reach a safe state in a tight loop?
>>>
>> Under the condition that an interrupt happens that causes a thread-switch
>> to a lower-or-equal priority thread.
>>
>   So what happens to the state of a thread if a thread-switch goes to a
> higher priority thread?  You'll still have to reach a "safe state" (whatever
> that is) in that case as well.
>
The higher-priority thread does what it wants. When it finishes,
it can either resume the preempted thread, or tell it to reach a safe
state ASAP and switch.


>> Yes, you know, the unix model, where every thread has a TSS,
>
> TSS? What's a TSS? [...]
   Throughout all these messages, I've been using the 486 architecture
as the standard example, as that's what we're trying to initially
implement Tunes for (besides OTOP). Generalizing, adapting, or dropping
statements about the 486 to other architectures is an exercice left to the
reader.


> There's no reason why a
> Unix implemenation for the 386 HAS to use a TSS for every thread (it has to
> use at least one to go into PM, but that's it).
>
Just efficiency, in the case you're preempting coarse-grained threads
(Unix calls them "process"), as you must save all this information,
and much more.

>> plus lots of other tables. The model that wastes space and time,
>
>   Some of those tables provide virtual memory support.
>
No. There are lists of file descriptors, of mmap() tables,
of signal handlers, of semaphores and suches, of FPU registers,
and various device-dependent states, that should be remembered
even though the thread will never use it.

> You CAN get rid of
> those tables, but then you're stuck with only the physical RAM in your
> machine.
>
No. I can have each thread get exactly what it needs, not more, not less,
by being cooperative.

>>>>>	But stacks do their own GC.  Values are pushed onto it for use, and when
>>>>> no longer needed, popped off.  What's the problem (besides keeping track of
>>>>> dynamically allocated globs of memory)?
>>>>>
>>>> The problem is that the stack may contain roots for GC'ed objects.
>>>> Surely you should read about the implementation of LISP/Scheme/ML systems...
>>>>
>> [Comments showing that you don't get the point]
>> The stack may contain pointers to objects that the GC would reclaim
>> otherwise. Hence the GC must scan the stack and mark as used any
>> object pointed to by it.
>>
>   I don't see how copying the stack will solve that.
>
I was explaining that while stacks arguably do "their own GC",
that doesn't release them from the requirements of the system-wide GC.
   Copying the stack is a completely unrelated trick.
It has to do with not having to reserve one stack page for every continuation,
where there can be tens of thousands of continuations,
and where continuations might be activated several times.
Of course continuations that we know are uniquely activated,
and that need fast switching time, can have reserved stack pages.


>> I think you have no idea of what GC is.
> 
>   Well, I think you have no idea how an OS really works, but that hasn't
> stopped you ...
>
I know the grungy details of Apple DOS 3.3, MSDOS, Linux, VSTa,
and the general design of many more OSes, plus all the traditional OS theory,
and lots of ideas stolen from research articles on the net.
Feel free to point me to places where to learn "how an OS really works"...


>   And garbage collection is going through memory looking for allocated
> memory that no longer is referenced, and freeing it.  A secondary objective
> may be to compact the memory so that free space is coalessed into one block,
> but that doesn't seem to be mandatory
That's *tracing* GC that you are describing.
There are other techniques, though only tracing allows compacting,
and it is maximally efficient memory-wise.
   Now, compacting requires that we can
make the difference between a pointer and a raw value,
which means that all GCable threads must be synchronized during a GC,
hence some kind of cooperation whatever it is.
   Actually, compaction is a particular case of migration,
which we WANT our system to support.
Hence this requirement is just impossible to remove anyway,
so let's fully benefit from the compaction it allows.


>   Well, given the example I used, how can it be done more intelligently?
>
> 	String s;
> 	String t;
>
> 	while((s = inputstream.readline()) != null)
> 	{
> 	  t += s;
> 	}
If you really need implement strings as contiguous blocks, use GNU obstack.
However, if you're doing anything high-level where lines count,
I don't see why you would need, and a list of lines might do better.
   Anyway, if you're getting things from a file, then lines do not matter,
and you may as well get its size before slurping it in a block,
or, better mmap() it into memory.
   If you're being interactive, then this is a lame program,
and intelligence is foreign to it whatsoever.


>> you seem not to be aware of the existence of COMPACTING GCs,
>> techniques well mastered since the 70's,
>> which even my plain old HP28 support
>> (actually, that's where I learnt about GC for the first time),
>> and that solve any memory fragmentation problem.
>>
>   They don't completely solve the memory fragmentation problem.
>
They do. Read papers about it.
http://www.centerline.com/people/chase/GC/GC-faq.html


> The problem is, to compact memory, you have to move blocks around.
>
I don't quite see why that's a problem.
There are even real-time compacting GC's
(malloc() seldom is, reference counting never is),
and non-real-time seldom-moving GC's,
incremental GC's that are in the average faster than any malloc(), etc.

> That changes the starting address of the block of memory.
Perhaps you haven't heard about fixing pointers.

> So you HAVE that problem to contend with.
This problem was solved long ago. See all the litterature about GC.
For instance
	ftp://ftp.cs.utexas.edu//pub/garbage/gcsurvey.ps


> You can deal with that by using indirect pointers, and THAT
> can't be compacted (although, since each entry is the same size, the
> fragmentation problem isn't that bad.  The problem comes when you exceed the
> size of the indirect pointer block).
>
And you can fix-up pointers.
You can even do that at the same time that you traverse the pointer
structure to find out what isn't garbage and what is!


>>    Now your point is that dynamic allocation would be inefficient
>> (while it needn't be)
>> and that you want to just forbid it
>> (which makes it infinitely inefficient).
>>
>   Not forbid it, just seeing if there's a better way.
>
Until you find any (please tell us), we'll use that proven way
(see the FAQ, and H.G. Baker's papers).
   Compacting GC's are, in that worst case, infinitely faster than
any non-compacting GC, because in that worst case,
these never succeed due to swiss chess syndrom.

--    ,                                         ,           _ v    ~  ^  --
-- Fare -- rideau@clipper.ens.fr -- Francois-Rene Rideau -- +)ang-Vu Ban --
--                                      '                   / .          --
Join the TUNES project for a computing system based on computing freedom !
                 TUNES is a Useful, Not Expedient System
WWW page at URL: "http://www.eleves.ens.fr:8080/home/rideau/Tunes/"