Could we use multiple stacks?

Sean 'Captain Napalm' Conner spc@gate.net
Mon, 27 May 1996 18:50:39 -0400 (EDT)


On some network somewhere in cyberspace, Francois-Rene Rideau transmitted:
> 
> Dear Sean,
>   please do read the docs about GC and obstacks.
> For GC, read the FAQ,
> http://www.centerline.com/people/chase/GC/GC-faq.html
> For GNU obstack, just do read /usr/include/obstack.h on any linux box
> (I can send it to you), and/or read the obstack.[hc] in the GNU libc.
> 
  Okay I will (I haven't had time to since I received this).

> Notes:
>    to me, a thread is a unit of execution.
> It may consist in just what you say, or be much larger,
> include lots of state, etc.
> Unix processes are threads to me. Just coarse-grained threads.
>    I hope we're not in for word-fight.
> 
  No, now that you have actually defined what you mean by thread (everything
from what I consider a thread to what I consider a process).

> > The only difference between a
> > cooperative and a preemptive system is how other threads gain execution
> > time.
> > In a cooperative, they voluntarily give up the execution time,
> > whereas in a preemptive system, it is forcibly removed from them.
> This "mere" difference implies that only cooperative threads can decide
> WHEN to yield execution, so that only when the thread is in a safe state
> can GC or other migration service happen that require strong consistency
> constraints.
> 
  A thread in a preemptive system reaches a "safe state" when making a
"system" call (and don't take me to mean the typical Unix implementation of
a system call into a monolothic kernel - I'm talking about the standard,
provided routines that come with Tunes, for example).

  You may not realize this, but it's the actual RARE program that IS
preempted on a preemptive system (most times, doing a 'system' call causes
the equivilent of a 'yeild').  Those that ARE preempted are usually in a
long involved calculation, or hung.

> > The other
> > difference may be the amount of state to be saved, but that's not important
> > right now.
> It's also a great performance factor.
> 
  And I suspect that you don't have an idea of how LITTLE information
actually needs to saved in a preemptive system.

> >   If I'm wrong, please tell me how it DOES work.
> It works because when a GC is needed, I know that all threads
> are in a satisfactory state. No temporary calculation is done
> that will break its assumptions (e.g. registers with non-compliant
> values, uninitialized data on heap, etc).
> 
  See above.

> >   Where's the problem?
> >
> >   Or if I'm wrong, please tell me how it DOESN'T work (or what I'm missing).
> >
> It doesn't work, because threads were preempted in a "dirty" state,
> where they could not comply with the inter-thread GC requirements.
> A pointer was used in %eax, which the GC does not recognize,
> while a value was used in %ebx, during an intermediate calculation,
> that the GC will think is a pointer. The shared heap will contain
> uninitialized pointers, which will make the GC go to hell,
> and Curly and Moe might be in a race for heap space.
> Too bad.
>    Of course, I could have semaphores for the GC, and require that
> every thread will eventually get the GC semaphore,
> so I can synchronize them all and the GC can begin.
> Hey! That's precisely what is called "cooperative multithreading". Wow!
> 
  But you still WANT the ability for higher priority threads to PREEMPT
lower priority threads.  And I proved that one can do cooperative in a
preemptive system.  

  So it's best for you to do a preemptive system, because you'll end up
doing that anyway.

> >> Real-Time and cooperative are NOT contradictory.
> >>
> >   And I claim they are.
> >
> We'll see.
> Feel free to propose *and implement* a concurrent design for Tunes,
> and/or develop another project. But then, you should rather take a look
> at VSTa, QNX, and L4.
> 
  I've taken a look at QNX (used it at a company I did some work for) and I
really liked it (microkernel message based, and one of the fastest OSes I've
used.  And guess what?  It's preemptive, and even running X, was faster than
MS-Windows).

  I've also done work under cooperative environments, and the amount of
overhead I had as a programmer apalled me, and under one system, the amount
of hacking required to get decent performance (actually, it was impressive)
lead to spagetti code the likes I haven't seen since BASIC (and this wasn't
coded in BASIC), only it was worse, poorly understood, hard to change, and
with NO documentation other than what (sparse) comments existed in the code.

  Granted, you state that I'll be hidden from such details by the compiler,
but the performance goes to hell, and in a pathological case (it's the only
process running) it STILL has the overhead of checking.

> >   I can't just stop Thread B, because that would be ... oh ...
> > PREEMPTION, now wouldn't it?
> I can stop thread B, and it's called an interrupt.
> The thread high priority interrupt-driven threads that preempt
> the normal thread are not of equal priviledge,
> but still, normal threads must cooperate with each other,
> high-priority threads must all the more to cooperate with each other,
> and communication between high-P and normal threads have
> even more drastic requirements that threads must cooperate with.
> That's why I called my proposal "semi-preemptive". If you have
> a better name, give it. Still, threads must cooperate a lot.
> 
  But it's still preemption, which implies saving state on behalf of the
preempted thread.  If you have that, then why even bother with the
cooperative crap?  You can STILL do cooperative tasking under a preemptive
system (and I did present work to this very list that did just that).

> > Or do you get around that by calling some
> > special code in Thread B that causes it to PREEMPT itself?  Oh, real
> > COOPERATIVE there.
> If your interrupt-driven routine decides that B should terminate
> and switch to C, then it will indeed call a routine that B defined,
> that will have B reach next safe point in bounded time and switch thread.
> The default routine is just to change a RET into NOP, and wait for
> the thread to jump into. Other routines can be supplied.

  Then that's more work for the programmer to do when the system can do that
for me automattically (and correctly).

> If you won't call that cooperative, then please give me a better name.
> 
  "Gross hack"?  

> >[after lots of fanciful figures]
> >   To even hope for a response time of XuSecs in a cooperative system, you'll
> > have to insert checks every X/2usecs (assuming you can), which doubles the
> > overhead.
> Exactly. So in optimized code, we only need have two checks per switch,
> which ridiculous cost is more than covered by all the state that
> switching will not have had to save as a preemptive switch would have to,
> plus we gain the ability of having interthread GC.

  You seem to think that saving state in a preemptive system is exceedingly
costly?  Why?  Because you have to save tables and tables of information
(like pagetables)?  I hate to tell you this, but this ain't so.  For a 386:

  thread/process is running.  The timer interrupt denoting a task switch
occurs, so, being fair, I'll let this take 100 cycles to enter the interrupt
handler.  We pick up from there:

	; EIP:EFLAGS already on stack (as part of the 100 cycles)
	; interrupts disabled
		push	eax			; 2	[1]
		push	ebx			; 2
		push	ecx			; 2
		push	edx			; 2
		push	esi			; 2
		push	edi			; 2
		push	ebp			; 2
	; no need to push esp
	; get task control block
		mov	eax,[thistask]		; 2
		mov	[eax.stack],esp		; 2
	; save paging tables (in effect, save ALL state information
	; saved in memory belonging to this thread/process)
	; NOTE FARE:  THIS ISN'T EXPENSIVE!
		mov	ebx,cr3			; 6
		mov	[eax.pagetable],ebx	; 2
	; get last node in run queue (double link list)
	; link this task to end of run queue
		mov	ebx,[runqueuetail] 	; 4
		mov	ecx,[ebx.node.next]	; 4
		mov	[eax.node.next],ecx	; 2
		mov	[eax.node.prev],ebx	; 2
		mov	[ecx.node.prev],eax	; 2
		mov	[ebx.node.next],eax	; 2
	; get first node in run queue
	; and start that
		mov	eax,[runqueuehead]	; 4
		mov	ebx,[eax.node.prev]	; 4
		mov	ecx,[eax.node.next]	; 4
		mov	[ebx.node.next],ecx	; 2
		mov	[ecx.node.prev],ebx	; 2
	; next two can be commented out for extra speed if required
		xor	[eax.node.prev],ebx	; 6
		xor	[eax.node.next],ecx	; 6
	; make this the running task
		mov	[thistask],eax
	; restore all state information
		mov	ebx,[eax.pagetable]	; 4
		mov	cr3,ebx			; 5
		mov	esp,[eax.stack]		; 4
		pop	ebp			; 4	[2]
		pop	edi			; 4
		pop	esi			; 4
		pop	edx			; 4
		pop	ecx			; 4
		pop	ebx			; 4
		pop	eax			; 4
	; we're done!  Well, how about that!  All's that left is to
	; return with IRETD, and I'll use the worse case senario for
	; that!
		iretd				; 82

[1]	This sequence (pushing of registers) takes 14 cycles.  A PUSHAD
	will take 18.  In reality one might want to time the two for a 
	more accurate result.  The PUSHAD takes longer, but it allows the 
	pre-fetch queue time to catch up, while the individual PUSHes may
	cause it to flush faster, resulting in slower code.

[2]	This sequence (poping of registers) takes 28 cycles, while the POPAD
	takes 24.  While we saved 4 cycles in pushing the registers (if we
	push esp to simulate the effect of PUSHAD (and cut our savings down
	to 2 cycles)) it looks to be lost here.  Again, a test is needed
	with the various combinations to get the best real-world timings.

  We now have a total running time (including the extra two instructions
that can (if required) be cut and including the 99 cycles interrupt
overhead) of 292 cycles.  On a 33MHz system, that's 8,848 nSecs, well within
my 10uSec overhead handling case 8-) and will allow us to switch 113,013
tasks/second (if that's all we're doing, which makes for some good bench
marks).

  If you can make cooperative tasking work as fast, then more power to you,
but seeing how a higher priority thread will "preempt" (ahem) lower priority
threads, you'll have to do the same thing ANYWAY so why not just go ahead
and use preemption (seeing how you can do cooperative multitasking in a
preemptive system anyway!)?

  And Fare, it's important you realize that 'mov ebx,cr3/mov cr3,ebx'
fragments preserve the external CPU state of a thread (no "copying kilobytes
of state, much less bytes of state required").  That's 11 cycles.

>    Time-critical code can be optimized, which removes any problem.
> Other code can live with yet another 1-5% overhead.
> 
  "yet another"?  If we're loosing 1-5% for cooperative overhead, what other
overhead do you have?

> >>    Anyway Real-Time threads require cooperation *with each other*,
> >> as they lose any interest if they are not guaranteed to execute
> >> until completion. Of course they have priority over non-real-time threads,
> >> and need not cooperate with these.
> >>
> >   So effectively, you HAVE a preemptive system.
> >
> Call it as you like it. I fsck names.
> 
  Pardon?

> >> A kernel-based scheduler requires at least two task switches
> >> (400 cycles), and won't allow interthread GC.
> >>

  I've just shown a kernel-based scheduler do it under 300 cycles and
requiring NO task switching.

> >   How do you figure that?  The task switcher under AmigaDOS only required
> > one task switch (to the new task), but that's because switching to kernel
> > mode didn't require a task switch.
> >
  And you didn't respond to this.

> >   You seem to want rings, where threads within a ring are cooperative, but
> > between rings are preemptive.
> >
> >   You therefore have a preemptive system.
> >
> If you say so, it must be true.
> I just don't care about the name;

  You best care about the name, since you have to communicate to other
programmers what you are trying to do and to avoid the arguments that we're
having right now.  If you can't communicate effectively, then it doesn't
matter if you have the best system in the world, no one will understand you
to use it.

> but it sure is not the same "preemptive system" as with Unix,
> and threads are sure meant to "cooperate" in many ways.
> I'm ready to say what you say, as long as you let me do as I want.
> 
  I'm still waiting for you to do it (or see examples, like I've done).

> >> Yes, you know, the unix model, where every thread has a TSS,
> >
> > TSS? What's a TSS? [...]
>    Throughout all these messages, I've been using the 486 architecture
> as the standard example, as that's what we're trying to initially
> implement Tunes for (besides OTOP). Generalizing, adapting, or dropping
> statements about the 486 to other architectures is an exercice left to the
> reader.
> 
  But the tone of your comment lead me to believe that you believe in "all
the World's a PC (TM)".  

> > There's no reason why a
> > Unix implemenation for the 386 HAS to use a TSS for every thread (it has to
> > use at least one to go into PM, but that's it).
> >
> Just efficiency, in the case you're preempting coarse-grained threads
> (Unix calls them "process"), as you must save all this information,
> and much more.
> 
  Well, can you explain to me WHAT exactly has to be saved, and HOW?  Unless
I'm terribly mistaken, the code I presented above will do just that (and
doesn't save nearly the amount of state that going through a TSS will).

> >> plus lots of other tables. The model that wastes space and time,
> >
> >   Some of those tables provide virtual memory support.
> >
> No. There are lists of file descriptors, of mmap() tables,
> of signal handlers, of semaphores and suches, of FPU registers,
> and various device-dependent states, that should be remembered
> even though the thread will never use it.
> 
  I can't even begin to understand what you are talking about, unless you
mean that file descriptors, mmap() tables, signal handlers, semaphores, FPU
registers, device driver states and such are stored in fixed memory
locations and have to be copied on each task switch.  AND THAT JUST ISN'T
THE CASE!  All that is external to the CPU and need not be saved (well,
reference must be saved, but that's hardly the same as copying all that).

> > You CAN get rid of
> > those tables, but then you're stuck with only the physical RAM in your
> > machine.
> >
> No. I can have each thread get exactly what it needs, not more, not less,
> by being cooperative.
> 
  Well, I can have each thread get exactly what it needs, not more, not
less, by being preemptive.  So?

> >>>>>	But stacks do their own GC.  Values are pushed onto it for use, and when
> >>>>> no longer needed, popped off.  What's the problem (besides keeping track of
> >>>>> dynamically allocated globs of memory)?
> >>>>>
> >>>> The problem is that the stack may contain roots for GC'ed objects.
> >>>> Surely you should read about the implementation of LISP/Scheme/ML systems...
> >>>>
> >> [Comments showing that you don't get the point]
> >> The stack may contain pointers to objects that the GC would reclaim
> >> otherwise. Hence the GC must scan the stack and mark as used any
> >> object pointed to by it.
> >>
> >   I don't see how copying the stack will solve that.
> >
> I was explaining that while stacks arguably do "their own GC",
> that doesn't release them from the requirements of the system-wide GC.
>    Copying the stack is a completely unrelated trick.
> It has to do with not having to reserve one stack page for every continuation,
> where there can be tens of thousands of continuations,
> and where continuations might be activated several times.
> Of course continuations that we know are uniquely activated,
> and that need fast switching time, can have reserved stack pages.
> 
  I'm still lost.  

  ESP.  Stack pointer.  Perhaps you're familiar with it?  Just point ESP off
to a new region in memory.  Each "continuation" has it's own copy of the
stack pointer, whether it's ESP, A7, S, R14 or R30 (depending upon the
architecture).  Am I missing something here?

> >> I think you have no idea of what GC is.
> > 
> >   Well, I think you have no idea how an OS really works, but that hasn't
> > stopped you ...
> >
> I know the grungy details of Apple DOS 3.3, MSDOS, Linux, VSTa,
> and the general design of many more OSes, plus all the traditional OS theory,
> and lots of ideas stolen from research articles on the net.
> Feel free to point me to places where to learn "how an OS really works"...
> 
  You got me.  And here I thought I understood how OSes work, from dealing
with the internals of OS-9, OS/2, MS-DOS, Linux, AmigaDOS and some passing
knowledge of other Unix variants, QNX and VMS and concluding that you are
either:

	1. totally ignorant of OS design
	2. being intentionally obscure in your terminology
	3. being unintentionally obscure in your terminology
	4. some of the above
	5. all the above

  I guess there IS a difference between the mindsets of European and
American hackers.

> >   Well, given the example I used, how can it be done more intelligently?
> >
> > 	String s;
> > 	String t;
> >
> > 	while((s = inputstream.readline()) != null)
> > 	{
> > 	  t += s;
> > 	}
> If you really need implement strings as contiguous blocks, use GNU obstack.

  Strings, arrays, who the cares what it is?  It's an example for crying out
loud.  Lot's of dynamic allocations and deallocations of small (under 1K)
blocks of memory which tend (in the abscense of GC) to fragment memory to
hell and back.

> However, if you're doing anything high-level where lines count,
> I don't see why you would need, and a list of lines might do better.
>    Anyway, if you're getting things from a file, then lines do not matter,
> and you may as well get its size before slurping it in a block,
> or, better mmap() it into memory.
>    If you're being interactive, then this is a lame program,
> and intelligence is foreign to it whatsoever.
> 
  Two points:  you missed the point, and quit with calling everything
'lame'.  It gets old real quick, especially without providing supporting
evidence as to why it's 'lame'.

> > The problem is, to compact memory, you have to move blocks around.
> >
> I don't quite see why that's a problem.
> There are even real-time compacting GC's
> (malloc() seldom is, reference counting never is),
> and non-real-time seldom-moving GC's,
> incremental GC's that are in the average faster than any malloc(), etc.
> 
  Memory is allocated somehow.  If I use malloc(), it's in reference to
grabing memory.  Again you're getting caught on the details of the
traditional implementation of malloc() on Unix not providing GC.  I never
said such a thing.  Would it help if I did:

		sometype x = new sometype;

  instead of:

		sometype x = malloc(sizeof(sometype));

?

  Or even:

		sometype x = allocate_garbage_collectable_memory_big_enough_
to_store_sometype;

?

>    Compacting GC's are, in that worst case, infinitely faster than
> any non-compacting GC, because in that worst case,
> these never succeed due to swiss chess syndrom.
> 
  They are not infinitely faster.  They are better in the worst case, since
they avoid the swiss cheese syndrome.  Nothing is "infintely faster" since
there is a finite speed to everything (186,282.39 mps/298,052 kps).

  -spc (c -- not just a good idea.  It's the law!)