Could we use multiple stacks?

Tue, 28 May 1996 21:52:20 +0200 (MET DST)

[Note: this belongs to, not]


>> Dear Sean,
>>   please do read the docs about GC and obstacks.
>> For GC, read the FAQ,
>> For GNU obstack, just do read /usr/include/obstack.h on any linux box
>> (I can send it to you), and/or read the obstack.[hc] in the GNU libc.
>   Okay I will (I haven't had time to since I received this).
Also, obstack is thoroughly documented in the tex/info pages for the glibc.

>> Notes:
>>    to me, a thread is a unit of execution.
>> It may consist in just what you say, or be much larger,
>> include lots of state, etc.
>> Unix processes are threads to me. Just coarse-grained threads.
>>    I hope we're not in for word-fight.
>   No, now that you have actually defined what you mean by thread (everything
> from what I consider a thread to what I consider a process).
I thought this has always been obvious.
Just added an entry for it in the Tunes Glossary.

>>> The only difference between a
>>> cooperative and a preemptive system is how other threads gain execution
>>> time.
>>> In a cooperative, they voluntarily give up the execution time,
>>> whereas in a preemptive system, it is forcibly removed from them.
>> This "mere" difference implies that only cooperative threads can decide
>> WHEN to yield execution, so that only when the thread is in a safe state
>> can GC or other migration service happen that require strong consistency
>> constraints.
>   A thread in a preemptive system reaches a "safe state" when making a
> "system" call (and don't take me to mean the typical Unix implementation of
> a system call into a monolothic kernel - I'm talking about the standard,
> provided routines that come with Tunes, for example).
In no way can the state before a system call be considered to be safe
just because the call was made.
Preemption guarantees that the state of preempted threads will never be safe.
   Guaranteeing that a call will be eventually made within a bounded
time interval, and that thread state will be safe,
so that threads can synchronize,
is precisely what cooperation is about.
   Safe means that many assumptions can be made about the layout
of registers, memory, FPU, I/O devices, etc,
so that GC, checkpointing, and various other kinds of object migration
can safely happen.
   Even though you seemingly have no clue about GC,
you might agree that assumptions such as
  * %eax, %ecx, %edx contain no pointer, while
  * %ebx, %ebp, %esi, %edi, %esp contain valid GC-able values
   (either an recognizable unboxed object or a block-aligned pointer), and
  * all memory is properly initialized and framed
just CANNOT be given by a preemptive system,
least you put most everything inside frequent cli-sti statements,
which makes the sti-cli sequence the PAUSE statement of a cooperative system.

>   You may not realize this, but it's the actual RARE program that IS
> preempted on a preemptive system (most times, doing a 'system' call causes
> the equivilent of a 'yeild').  Those that ARE preempted are usually in a
> long involved calculation, or hung.
This isn't exact or relevant: in a preemptive system,
every tick causes a thread to be preempted out of a system call.
Then, it suffices that you can't be sure that all threads can be synchronized
so that no inter-thread compacting GC is possible,
which is just not bearable on a persistent system.

>>> The other
>>> difference may be the amount of state to be saved, but that's not important
>>> right now.
>> It's also a great performance factor.
>   And I suspect that you don't have an idea of how LITTLE information
> actually needs to saved in a preemptive system.
I know exactly how much information needs to be saved,
and how much costs the n-fold indirect addressing
(file handle/pointer table/file structure/buffer/etc)
used so that only "head pointers" to that information need to be saved
at thread switches, which means pointer restore at every access
instead of restore/save at switches only,
which means centralization of accesses for all such resources, etc.
That's paying the cost of parallel synchronization on sequential systems.

>>>   If I'm wrong, please tell me how it DOES work.
>> It works because when a GC is needed, I know that all threads
>> are in a satisfactory state. No temporary calculation is done
>> that will break its assumptions (e.g. registers with non-compliant
>> values, uninitialized data on heap, etc).
>   See above.
Yes, see above.

>>>   Where's the problem?
>>>   Or if I'm wrong, please tell me how it DOESN'T work (or what I'm missing).
>> It doesn't work, because threads were preempted in a "dirty" state,
>> where they could not comply with the inter-thread GC requirements.
>> A pointer was used in %eax, which the GC does not recognize,
>> while a value was used in %ebx, during an intermediate calculation,
>> that the GC will think is a pointer. The shared heap will contain
>> uninitialized pointers, which will make the GC go to hell,
>> and Curly and Moe might be in a race for heap space.
>> Too bad.
>>    Of course, I could have semaphores for the GC, and require that
>> every thread will eventually get the GC semaphore,
>> so I can synchronize them all and the GC can begin.
>> Hey! That's precisely what is called "cooperative multithreading". Wow!
>   But you still WANT the ability for higher priority threads to PREEMPT
> lower priority threads.  And I proved that one can do cooperative in a
> preemptive system.  
>   So it's best for you to do a preemptive system, because you'll end up
> doing that anyway.

>>>> Real-Time and cooperative are NOT contradictory.
>>>   And I claim they are.
>> We'll see.
>> Feel free to propose *and implement* a concurrent design for Tunes,
>> and/or develop another project. But then, you should rather take a look
>> at VSTa, QNX, and L4.
>   I've taken a look at QNX (used it at a company I did some work for) and I
> really liked it (microkernel message based, and one of the fastest OSes I've
> used.  And guess what?  It's preemptive, and even running X, was faster than
> MS-Windows).
Then what? Does QNX provide GC, orthogonal persistence,
fine-grained migration, protection through scoping and typing?
Again, if you want a free QNX-like OS, get VSTa, the HURD, not Tunes.
Eumel, L3, L4 are also a free uK msg based system that do persistence.
You might like them.
If you don't like the way Tunes is designed,
find a project that suits you better,
and everyone will be much happier.

>   I've also done work under cooperative environments, and the amount of
> overhead I had as a programmer apalled me, and under one system, the amount
> of hacking required to get decent performance (actually, it was impressive)
> lead to spagetti code the likes I haven't seen since BASIC (and this wasn't
> coded in BASIC), only it was worse, poorly understood, hard to change, and
> with NO documentation other than what (sparse) comments existed in the code.
Those environments required the programmer to cooperate without
providing compilers generating cooperating code.
Exactly why no C-based OS can do secure persistent GC.
Because the cooperative environments you used were poor
doesn't mean they must be.

>   Granted, you state that I'll be hidden from such details by the compiler,
> but the performance goes to hell,
Why should it go to hell ???
Compiler technology is well understood, and compiler-generated code
easily competes with hand-generated code without the same debugging problems.
Surely it is allowed to fine-tune critical sections into assembly, too.

> and in a pathological case (it's the only
> process running) it STILL has the overhead of checking.
Programs running in preemptive environment also STILL suffer
the overhead of the interrupt-driven scheduler when alone.
And I explained already how any checking cost could be circumvented
in a cooperative way with a "recovery routine".
If you have a performance-critical process
meant to run alone for a long time,
then it means it doesn't interact with other processes,
and runs a tight loop for which such a recovery routine
will have been defined anyway.
If there's no switch, the routine won't be called,
and there's no more overhead than with a preemptive system.

>>>   I can't just stop Thread B, because that would be ... oh ...
>>> PREEMPTION, now wouldn't it?
>> I can stop thread B, and it's called an interrupt.
>> The thread high priority interrupt-driven threads that preempt
>> the normal thread are not of equal priviledge,
>> but still, normal threads must cooperate with each other,
>> high-priority threads must all the more to cooperate with each other,
>> and communication between high-P and normal threads have
>> even more drastic requirements that threads must cooperate with.
>> That's why I called my proposal "semi-preemptive". If you have
>> a better name, give it. Still, threads must cooperate a lot.
>   But it's still preemption, which implies saving state on behalf of the
> preempted thread.  If you have that, then why even bother with the
> cooperative crap?  You can STILL do cooperative tasking under a preemptive
> system (and I did present work to this very list that did just that).
It's interruption, as Jecel coined out, and you agreed to in another mail.
It's NOT preemption, as exists in Unix.
Higher-Priority threads INTERRUPT lower-priority ones,
and do not preempt them;
they may SIGNAL that it's time to switch,
but will let any actual switching up to the lower-priority thread.
That's still cooperative.
And even if you don't like it, again,

>   Then that's more work for the programmer to do when the system can do that
> for me automattically (and correctly).
No, because the compiler does that without your bothering.
What the high-level programmer sees is a preemptive world.
If you're doing non-atomic manual low-level programming,
then you're asking for trouble and you're on your own.

>> If you won't call that cooperative, then please give me a better name.
>   "Gross hack"?
Cooperative explains more.

>>>[after lots of fanciful figures]
>>>   To even hope for a response time of XuSecs in a cooperative system, you'll
>>> have to insert checks every X/2usecs (assuming you can), which doubles the
>>> overhead.
>> Exactly. So in optimized code, we only need have two checks per switch,
>> which ridiculous cost is more than covered by all the state that
>> switching will not have had to save as a preemptive switch would have to,
>> plus we gain the ability of having interthread GC.
>   You seem to think that saving state in a preemptive system is exceedingly
> costly?  Why?  Because you have to save tables and tables of information
> (like pagetables)?  I hate to tell you this, but this ain't so.
If there are to be any dynamically-windable objects,
a preemptive system will have to save every single such object,
and to require a critical section wherever an assumption may be broken
about such objects, as well as on all shared objects.

>[example for switching threads,
>saving only general purpose registers and page table]
   Your example assumes that no preempted task will have modified
segment registers or used the FPU registers.
Any shared hardware of software resource will have to be completely saved,
even though the preempted thread didn't use any of them.
   Actually, a real-life cooperative thread switching routine
will be much faster than yours, while a a real-life preemptive
thread switching will be much larger.

> On a 33MHz system, that's 8,848 nSecs, well within
> my 10uSec overhead handling case 8-)
This is completely irrelevant.
Interrupt handling latency is independent
from normal thread switching latency.

> and will allow us to switch 113,013
> tasks/second (if that's all we're doing, which makes for some good bench
> marks).
If we're out with dumb benchmarks, here is my version:
  movl current_thread, %eax		; 4
  movl next_thread_member(%eax), %eax	; 4
  movl %eax, current_thread		; 2
  jmp activate_thread_member(%eax)	; 10+m

For my idle thread,
the activate_thread_member would point to activate_next_thread,
hence m=2, and that's a 22 cycle total switching time,
hence a 1.5 million switches per second benchmark,
far more than your ridiculous 110 thousands.
   Of course, my compiler can detect that all threads are idle,
and not generate code for them, hence 0 us time taken to switch
thread, and a magnificent +infinity thread switch per second benchmark.
   Real-life situations make cooperative gain less overwhelming.
Still, cooperating threads save only what they need.

>   If you can make cooperative tasking work as fast, then more power to you,
I can make it a lot faster, hence much more power to me.

> but seeing how a higher priority thread will "preempt" (ahem) lower priority
> threads, you'll have to do the same thing ANYWAY so why not just go ahead
> and use preemption (seeing how you can do cooperative multitasking in a
> preemptive system anyway!)?
Interrupts will certainly do most of the work you describe,
though not all of it;
higher-priority threads will have altogether different
system-wide assumptions and restrictions than lower-priority threads.

>   And Fare, it's important you realize that 'mov ebx,cr3/mov cr3,ebx'
> fragments preserve the external CPU state of a thread (no "copying kilobytes
> of state, much less bytes of state required").  That's 11 cycles.
Saving page tables IS expensive:
it means lots of cache misses to refill the table.
That's the *hidden cost*, as opposed to the obvious cost.
Benchmarks where everything fits the cache may hide it;
real-life measurements won't.
And even if you don't explicitly copy it,
the bytes of states that make the page tables are required.

>>    Time-critical code can be optimized, which removes any problem.
>> Other code can live with yet another 1-5% overhead.
>   "yet another"?  If we're loosing 1-5% for cooperative overhead, what other
> overhead do you have?
Overhead of preemption:
* explicitly synchronizing on low-level operations, particularly on GC.
* saving lots of unused state.

This overhead is paid for ALL code.
Cooperation does not pay for performance-critical code.

>>>>    Anyway Real-Time threads require cooperation *with each other*,
>>>> as they lose any interest if they are not guaranteed to execute
>>>> until completion. Of course they have priority over non-real-time threads,
>>>> and need not cooperate with these.
>>>   So effectively, you HAVE a preemptive system.
>> Call it as you like it. I fsck names.
>   Pardon?
You're playing words. I hate that.

>>>   How do you figure that?  The task switcher under AmigaDOS only required
>>> one task switch (to the new task), but that's because switching to kernel
>>> mode didn't require a task switch.
>   And you didn't respond to this.
I granted you can optimize task switching into one-and-a-half task switching,
by software emulating the hardware behavior of a task switch, and factoring
out things done by switching to kernel mode (Linux does that).
The 68K architecture does it better with bitmasked store-multiple and
load-multiple instructions, with a simple system stack vs user stack.
No Intel crap there.
   Still, the above arguments for cooperative vs preemptive still hold
for a 68K system as well as for any system, unless you've got an
architecture that does GC and all we need atomically in hardware.

>>>   You seem to want rings, where threads within a ring are cooperative,
>>> but between rings are preemptive.
>>>   You therefore have a preemptive system.
>> If you say so, it must be true.
>> I just don't care about the name;
>   You best care about the name, since you have to communicate to other
> programmers what you are trying to do and to avoid the arguments that we're
> having right now.  If you can't communicate effectively, then it doesn't
> matter if you have the best system in the world, no one will understand you
> to use it.
Calling the system "preemptive" is altogether misleading.
It's "interruptive cooperative".
I don't care about the name *in as much as the concept passes*.
Seamingly you're the only one on the list not to get the concept,
because you stick to inadequate dogmatic naming.

>> but it sure is not the same "preemptive system" as with Unix,
>> and threads are sure meant to "cooperate" in many ways.
>> I'm ready to say what you say, as long as you let me do as I want.
>   I'm still waiting for you to do it (or see examples, like I've done).
Then let me do it.

>>>> Yes, you know, the unix model, where every thread has a TSS,
>>> TSS? What's a TSS? [...]
>>    Throughout all these messages, I've been using the 486 architecture
>> as the standard example, as that's what we're trying to initially
>> implement Tunes for (besides OTOP). Generalizing, adapting, or dropping
>> statements about the 486 to other architectures is an exercice left to the
>> reader.
>   But the tone of your comment lead me to believe that you believe in "all
> the World's a PC (TM)".
To emit such a judgement,
you seemingly haven't read much of the Tunes pages,
mailing-list archive, and sources.

>>> There's no reason why a
>>> Unix implemenation for the 386 HAS to use a TSS for every thread
>>> (it has to
>>> use at least one to go into PM, but that's it).
>> Just efficiency, in the case you're preempting coarse-grained threads
>> (Unix calls them "process"), as you must save all this information,
>> and much more.
>   Well, can you explain to me WHAT exactly has to be saved, and HOW?  Unless
> I'm terribly mistaken, the code I presented above will do just that (and
> doesn't save nearly the amount of state that going through a TSS will).
You don't save segment state, LDT, I/O permissions, FPU state.
If you do, you'll do exactly the job of a TSS, and a TSS will be faster.
If you ask the user to have a critical section around any use of
these resources, then you indeed have the lamest kind of system,
with the defects of both preemptive and cooperative kinds,
instead of their advantages.

>>>> plus lots of other tables. The model that wastes space and time,
>>>   Some of those tables provide virtual memory support.
>> No. There are lists of file descriptors, of mmap() tables,
>> of signal handlers, of semaphores and suches, of FPU registers,
>> and various device-dependent states, that should be remembered
>> even though the thread will never use it.
>   I can't even begin to understand what you are talking about, unless you
> mean that file descriptors, mmap() tables, signal handlers, semaphores, FPU
> registers, device driver states and such are stored in fixed memory
> locations and have to be copied on each task switch.  AND THAT JUST ISN'T
> THE CASE!  All that is external to the CPU and need not be saved (well,
> reference must be saved, but that's hardly the same as copying all that).
You seem to be talking only about pure switching time cost,
which I grant you is not that enormous for preemptive threads,
though already much bigger than for cooperative threads.
I'm talking about space cost as well as time cost
(even with big enough virtual memories,
space cost induces time cost by overly swapping):
even though your thread might not use all resources,
your preemptive system must reserve space for them all.
With 1000 threads and 16KB/thread, that's 16MB of system space overhead,
plus all the associated swap time cost.
Cooperative threads might occupy a less than one hundred bytes each,
and easily fit the computer's RAM.

>> No. I can have each thread get exactly what it needs, not more, not less,
>> by being cooperative.
>   Well, I can have each thread get exactly what it needs, not more, not
> less, by being preemptive.  So?
Your preempter will have to save all potentially used registers
and global variables, including
   Besides, you just can't express a dynamic-wind construct,
and save the dynamical state needed by your thread,
because your preempted threads aren't synchronized enough.

>>>>>>>	But stacks do their own GC.  Values are pushed onto it for use, and when
>>>>>>> no longer needed, popped off.  What's the problem (besides keeping track of
>>>>>>> dynamically allocated globs of memory)?
>>>>>> The problem is that the stack may contain roots for GC'ed objects.
>>>>>> Surely you should read about the implementation of LISP/Scheme/ML systems...
>>>> [Comments showing that you don't get the point]
>>>> The stack may contain pointers to objects that the GC would reclaim
>>>> otherwise. Hence the GC must scan the stack and mark as used any
>>>> object pointed to by it.
>>>   I don't see how copying the stack will solve that.
>> I was explaining that while stacks arguably do "their own GC",
>> that doesn't release them from the requirements of the system-wide GC.
>>    Copying the stack is a completely unrelated trick.
>> It has to do with not having to reserve one stack page for every continuation,
>> where there can be tens of thousands of continuations,
>> and where continuations might be activated several times.
>> Of course continuations that we know are uniquely activated,
>> and that need fast switching time, can have reserved stack pages.
>   I'm still lost.
Sure you are. Learn about continuations,
and uniqueness annotations on types (aka linear programming),
and how to implement garbage collection.

>   ESP.  Stack pointer.  Perhaps you're familiar with it?  Just point ESP off
> to a new region in memory.  Each "continuation" has it's own copy of the
> stack pointer, whether it's ESP, A7, S, R14 or R30 (depending upon the
> architecture).  Am I missing something here?
Yes. You're missing the way continuations can be implemented
so that they can be entered multiple times;
you're missing how to share a single stack space between cooperative
threads instead of reserving page-granular stacks for each;
you're missing how to have 1000 threads safely fit a 4MB computer's memory.

>>>> I think you have no idea of what GC is.
>>>   Well, I think you have no idea how an OS really works, but that hasn't
>>> stopped you ...
>> I know the grungy details of Apple DOS 3.3, MSDOS, Linux, VSTa,
>> and the general design of many more OSes, plus all the traditional OS theory,
>> and lots of ideas stolen from research articles on the net.
>> Feel free to point me to places where to learn "how an OS really works"...
>   You got me.  And here I thought I understood how OSes work, from dealing
> with the internals of OS-9, OS/2, MS-DOS, Linux, AmigaDOS and some passing
> knowledge of other Unix variants, QNX and VMS and concluding that you are
> either:
> 	1. totally ignorant of OS design
> 	2. being intentionally obscure in your terminology
> 	3. being unintentionally obscure in your terminology
> 	4. some of the above
> 	5. all the above
>   I guess there IS a difference between the mindsets of European and
> American hackers.
The difference is rather that
* you never question traditional terminology
 you're ready to accept that there exists appropriate concepts such as
 "multimedia", "object-oriented", and "information superhighway"
 just because everyone uses the terminology.
* you think that because people don't agree with the tradition,
 people don't know the tradition, because the tradition is everything.
* you never question traditional design, nor try to read articles
 by people challenging it. Because Microware, IBM, M$, AT&T, and Commodore
 did it in a way, you won't even think other ways are possible.
* particularly, you think that managing the hardware, GC, persistence,
 security, are independent things.
* You've never read, heard, or thought about
 GC, orthogonal persistence, scoping, dynamic-winding,
 multiply-entered continuations, linear vs non-linear programming,
 and how can be implemented.
* You believe that all languages are equivalently expressive,
 just because they are Turing-equivalent.
* You end up your talk with stupid racist comments that you think clever.

>[Still the string slurping example]
>> If you really need implement strings as contiguous blocks, use GNU obstack.
>   Strings, arrays, who the cares what it is?  It's an example for crying out
> loud.  Lot's of dynamic allocations and deallocations of small (under 1K)
> blocks of memory which tend (in the abscense of GC) to fragment memory to
> hell and back.
If you had read how obstack work, you'd have seen that in no case
such allocation/deallocationi stuff need be done.
Just allocation of fixed size buffers (say all 4K)
until the object stabilizes,
then reallocation into one object if the buffer was big enough,
or else copy into one big object.
No exceptional fragmentation involved.
If you fine-tune block size with respect to the malloc routine,
or give a proper malloc routine to the obstack,
you can gain great performance.
Another solution is to cache obstack buffers,
if you use them often.
   Besides, a compacting GC would solve any fragmentation problem.

>> However, if you're doing anything high-level where lines count,
>> I don't see why you would need, and a list of lines might do better.
>>    Anyway, if you're getting things from a file, then lines do not matter,
>> and you may as well get its size before slurping it in a block,
>> or, better mmap() it into memory.
>>    If you're being interactive, then this is a lame program,
>> and intelligence is foreign to it whatsoever.
>   Two points:  you missed the point, and quit with calling everything
> 'lame'.  It gets old real quick, especially without providing supporting
> evidence as to why it's 'lame'.
I perfectly saw the point, and gave the perfect answer: GNU obstack.
Your only answer was to be upset about such programs,
not giving any way to support them,
and implying that they should be forbidden by imposing limits.
   Then english is not my first language,
and I don't intend to go look for more deprecatives than "lame",
because lame things don't deserve the hassle.
Feel free to teach me more.

>>> The problem is, to compact memory, you have to move blocks around.
>> I don't quite see why that's a problem.
>> There are even real-time compacting GC's
>> (malloc() seldom is, reference counting never is),
>> and non-real-time seldom-moving GC's,
>> incremental GC's that are in the average faster than any malloc(), etc.
>   Memory is allocated somehow.  If I use malloc(), it's in reference to
> grabing memory.  Again you're getting caught on the details of the
> traditional implementation of malloc() on Unix not providing GC.  I never
> said such a thing.  Would it help if I did:
> 		sometype x = new sometype;
>   instead of:
> 		sometype x = malloc(sizeof(sometype));
> ?
>   Or even:
> 		sometype x = allocate_garbage_collectable_memory_big_enough_
> to_store_sometype;
> ?
You completely miss the point,
which was not the name used by the low-level allocator,
but the high-level interface to the allocator.
Again, you're playing words instead of concepts.
   When I say malloc(), I mean manual allocation/disallocation.
If that's the whole memory allocation interface,
then you just CAN'T do any kind of GC not to tell about compacting GC
with it (though you can implement a compacting GC system on top of it
by first grabbing a large chunk of memory),
while to fight the swiss-cheese syndroms,
such allocators often go through worst-cases that have high latency.
   As for Reference-counting, it is the least real-time method ever,
because freeing a deep linked list of objects could mean
traversing all memory.
   Using a tracing GC,
with the interface being that all GC roots should be registered,
compacting becomes possible as well as real-time response.

>>    Compacting GC's are, in that worst case, infinitely faster than
>> any non-compacting GC, because in that worst case,
>> these never succeed due to swiss chess syndrom.
>   They are not infinitely faster.
> They are better in the worst case, since
> they avoid the swiss cheese syndrome. Nothing is "infintely faster" since
> there is a finite speed to everything (186,282.39 mps/298,052 kps).
Yes, they *are* infinitely faster.
Anything that works is infinitely faster than anything that doesn't.
In the worst case, allocators that are subject to the mad cow syndrom
(a name that suits better than swiss-cheese, as far as memory is concerned)
just WON'T WORK. The program will fail, not finish.
Time for the program to finish successfully: +infinity. Speed: 0.
The program with a compacting GC will finish in a finite time. Speed: >0.
The ratio of their speed is in indeed +infinity.

If you wish to be removed from the list,
please tell me by private e-mail to

--    ,                                         ,           _ v    ~  ^  --
-- Fare -- -- Francois-Rene Rideau -- +)ang-Vu Ban --
--                                      '                   / .          --
Join the TUNES project for a computing system based on computing freedom !
                 TUNES is a Useful, Not Expedient System
WWW page at URL: ""