Could we use multiple stacks?

Sean 'Captain Napalm' Conner
Thu, 23 May 1996 17:40:05 -0400 (EDT)

On some network somewhere in cyberspace, Francois-Rene Rideau transmitted:
> >>:Fare
> >:spc
> We'd already discussed all that hundreds of times!
> Let's sum up the point...
> >   Personally, I still don't see why you love the cooperative multitasking
> > model so much.
> Because you just can't do inter-thread GC without.
> Or else please explain me how.

  Well, the working definition I'm using is:

	thread: 	unit of execution, which includes:
		PC	- program counter
		CC	- condition codes
		regs	- other general use registers
		mem	- the memory address space the thread runs in (and
			  includes stacks, heaps, etc used by the thread)
			  Note that the same memory address space MAY be
			  shared with another thread

	process:	a conceptual entity, which includes:
		threads - at least one, possibly more
		resources (other than the CPU and memory space (which is
			  handled in the thread)) files, devices, etc.

  Now, you assume that GC between threads is impossible (or you don't know
how to do it) in a preemptive system.  The only difference between a
cooperative and a preemptive system is how other threads gain execution
time.  In a cooperative, they voluntarily give up the execution time,
whereas in a preemptive system, it is forcibly removed from them.  The other
difference may be the amount of state to be saved, but that's not important
right now.

  So, we have three threads, Larry, Curly and Moe.

  Cooperative system:

  Larry allocates memory, does some calculations and stores the results in
the memory.  It then calls a routine to pass the memory block to Curly then
yeilds the CPU.  Curly then runs.  He does some stuff with the memory block
and then is done with it (he drops the reference to it, say).  Then he
yeilds the CPU.  Moe does his thing (maybe poke Curly in the eyes or
something).  Then Larry comes in.  He doesn't care about the memory block
that he sent to Curly either (he's too busy having his hair pulled out to
really care).  So, that block of memory is a prime candidate for Tunes to
come along and GC it.  When Larry yeilds, Tunes comes along, and goes a
garbage collect on the memory spaces of Larry, Curly and Moe.

  Does the above change (or in any other way, differ) if the memory space
between Larry and Curly are shared?  I would assume that if they are shared,
then doing a GC on Larry will also affect Curly at the same time.

  If they differ, then a copy of the memory block is put into Curly's space
or that block is a segment of shared memory between the two.  Does that
change anything?

  If I'm wrong, please tell me how it DOES work.

  Preemptive system:

  Larry allocates memory, does some calculations and stores the results in
the memory.  He then calls a routine to pass the block to Curly and then
goes on his merry way.	He's oblivious to the fact that passing the memory
block to Curly caused him to yeild (lets say).	In fact, none of them are
aware of the fact they've been yeilded (or preempted).  Anyway, Curly gets
the memory, does stuff with it, then looses interest in it (looses the
reference).  Moe, however, currently does his thing (maybe calling the other
two nincompoops or something).

  At some point, Tunes comes along and says, Enough!  Let me Collect the
Garbage.  For each memory address space, all threads currently using that
space are blocked, then the GC is run in the memory space.

  Where's the problem?

  Or if I'm wrong, please tell me how it DOESN'T work (or what I'm missing).

> >> Remember:
> >> * interrupt-driven threads have their own stack/heaps anyway,
> >> * usual code is required to checkpoint regularly to allow real-time
> >>  cooperative threading
> >
> >   I'm sorry, but this is an oxymoron.  If you have "Real-Time" you have a
> > preemptive system.
> Real-Time and cooperative are NOT contradictory.

  And I claim they are.

> This just means that the time between the triggering event
> and the execution of the handler is bounded.
>    Cooperative threads can do this by either having bounded-time "recovery"
> routines to execute before to jump into the handler,
> or by ensuring that they will reach a safe point before next

  Say what?

  Example:  My Real-Time requirements require that the handler for event A
is executed within 10uSecs (10 microseconds).  Event sequence as follows:

	Thread B just passed a "check point".  It won't get to the next
	"check point" until another 50uSecs.

	Event A happens.  I have 10uSecs (or less) to handle this situation,
	or something bad happens.

  I can't just stop Thread B, because that would be ... oh ...
PREEMPTION, now wouldn't it?  Or do you get around that by calling some
special code in Thread B that causes it to PREEMPT itself?  Oh, real

  Now, upping the "check point" to be every 10uSecs will be a good
feat.  But assuming you can garuntee that (for instance, is this an option
to the compiler?  Insert a "check point" every XuSecs?  What if the program
was compiled for a Pentium and I run it on a slower 386?  Or it was compiled
with X=50, but I need for X to be 10?  Or is the code recompiled each time
it's run?  But I'm giving this to you), you still have large overhead.
Let's see, timings are for a 386 (since that's what I have timings for):

		CMP	[pause],opcode(RET)	; 5 cycles
		JZ	handler 		; 3 no jump, 7 otherwise
						; -
						; 8 (assuming continue)
						; 12 otherwise

  So, 8 cycles.  Let's assume a 33MHz clock, that's one cycle every 33nSecs
(nanoseconds).	264 nSecs.  On the 386, let's assume an average of 3 cycles
for every instruction, for 99 nSeconds.  10uSecs is 10,000nSecs, which gives
us about 100 instructions.  Take two off, and we have 98 instructions
between checkpoints.  2% overhead doesn't sound that bad (2% of the
instructions are checkpoints, but about 2.6% of execution time), but how
much overhead is there when switching threads?	If we do switch at each
"check point", then we get an (execution time) overhead of 4% (about a 150%
increase).  And this is assuming that the overhead to switch is nill!

  Why?	Assume we've just executed the CMP instruction and the event happens
(probably via an interrupt).  On the 386, the IRQ overhead is 59 cycles (if
in the same priviledge level, which I'm giving).  IRET is 38.  the JZ is 3.
The next CMP is 5, then a JZ of 7.  We then have:

		10,000		event happens - 10,000 before we blow
	-	 1,947		INT overhead
	-	 1,254		IRET
	-	    99		first JZ
	-	 9,736		before next "check point"
	-	   165		CMP
	-	   231		JZ to handle event
		-3,432		104 instructions

  We just went over a response time of 20uSecs!  Just the INT/IRET pair
account for 32% of the overhead in a response time of 10uSecs.

  To even hope for a response time of XuSecs in a cooperative system, you'll
have to insert checks every X/2usecs (assuming you can), which doubles the

>    Anyway Real-Time threads require cooperation *with each other*,
> as they lose any interest if they are not guaranteed to execute
> until completion. Of course they have priority over non-real-time threads,
> and need not cooperate with these.
  So effectively, you HAVE a preemptive system.

> >   But the check is per user process.  Bloat.
> >
> A simple checkpoint mechanism is 8 cycles on a 486:
> ...
> ...
> When you want faster code (3 cycles), you can have
> CMP [PAUSE],#opcode(RET)
  See above.

> A kernel-based scheduler requires at least two task switches
> (400 cycles), and won't allow interthread GC.
  How do you figure that?  The task switcher under AmigaDOS only required
one task switch (to the new task), but that's because switching to kernel
mode didn't require a task switch.

> >>  = a "recovery" routine other than "let me run until next checkpoint"
> >>   can be declared to the semi-preemptive scheduler,
> >			   ^^^^^^^^^^^^^^^
> >   Semi-preemptive?	Trying to have your cake and eat it too?
> >
> Not trying. Managing.
> I don't know any standard word for what I'm doing,
> so I invent my own.
  If each process calls yeild(), it's cooperative.  If a process (any
process) is stopped by another process, it's preemptive.  Period.

  You seem to want rings, where threads within a ring are cooperative, but
between rings are preemptive.

  You therefore have a preemptive system.

> >>   so that the program can reach a safe state from anywhere in the tight loop.
> >>
> >   Say what?  Under what conditions will a program have to reach a safe state
> > in a tight loop?
> >
> Under the condition that an interrupt happens that causes a thread-switch
> to a lower-or-equal priority thread.
  So what happens to the state of a thread if a thread-switch goes to a
higher priority thread?  You'll still have to reach a "safe state" (whatever
that is) in that case as well.

> >>>>>:Fare
> >>>>:Eric
> >>>:spc
> >>:Fare
> >:spc

> >>>	Correct me if I'm wrong, but all a continuation is is state information
> >>> such that you can restart the CPU with a saved state.  In a stack based
> >>> system, the only state you have is the stack pointers, so the fewer stacks
> >>> you have, the less state you have to save/restore.
> >>>
> >> You're wrong **in the general case**, that TUNES must implement:
> >> you must save not only the stack pointer, but the whole stack.
> >>	If you have coarse-grained threads, then you can reserve a big-enough
> >> stack for each of them, and all you need is save the pointer.
> >>	But if you allow fine-grained threads, and/or continuations that can be
> >> entered several times, not just once, then saving a stack pointer just
> >> won't work.
> >>
> >   Uh, Fare ... "coarse-grained threads"?
> >
> Yes, you know, the unix model, where every thread has a TSS,

  TSS?	What's a TSS?  <checks references>  Hmmm ... VAX?  No ... Motorola
68K?  .... No ... MIPS? ... nope ... SPARC? ... doesn't seem to be there ...
let me check ... ah!  The Intel 80386 and higher.  Fare, all the world ISN'T
a PCompatible (nor is it a Sun, but I digress).  There's no reason why a
Unix implemenation for the 386 HAS to use a TSS for every thread (it has to
use at least one to go into PM, but that's it).

> plus lots of other tables. The model that wastes space and time,

  Some of those tables provide virtual memory support.	You CAN get rid of
those tables, but then you're stuck with only the physical RAM in your

> that I don't want native Tunes code to follow.
> I wastes so much space that reserving additional stack space for each
> thread becomes cheap.
> This isn't the case if we're doing fine-grained threads
  How does this happen?

> > Multiple executions of a continuation?
> >
> A continuation *is* an unactivated thread.
> There's no reason why it can't be activated several times.
> E.g. every time some event happens (e.g. click on some button,
> clock alarm is reached, etc), you'll want to activate a new instance
> of that thread.
  Okay ... so?

	if (button_clicked == true)

  Nothing hard there ... as long as the code is re-entrant, there are no
problems I see ...

> >>>	But stacks do their own GC.  Values are pushed onto it for use, and when
> >>> no longer needed, popped off.  What's the problem (besides keeping track of
> >>> dynamically allocated globs of memory)?
> >>>
> >> The problem is that the stack may contain roots for GC'ed objects.
> >> Surely you should read about the implementation of LISP/Scheme/ML systems...
> >>
> [Comments showing that you don't get the point]
> The stack may contain pointers to objects that the GC would reclaim
> otherwise. Hence the GC must scan the stack and mark as used any
> object pointed to by it.
  I don't see how copying the stack will solve that.

> I think you have no idea of what GC is.

  Well, I think you have no idea how an OS really works, but that hasn't
stopped you ...

  And garbage collection is going through memory looking for allocated
memory that no longer is referenced, and freeing it.  A secondary objective
may be to compact the memory so that free space is coalessed into one block,
but that doesn't seem to be mandatory.

> [Long example against dynamically sized objects]
> >   But that's still besides the point.
> > The point is, you're fragmenting memory
> > quite bad.	Yes, I'm not loosing memory, but great, the overhead of keeping
> > the free list may be greater than the amount of free space I have.
> Besides the fact that things could be done much more intelligently
> and cleanly than you propose (have you ever heard about obstacks ?),

  Well, given the example I used, how can it be done more intelligently?

	String s;
	String t;

	while((s = inputstream.readline()) != null)
	  t += s;

> you seem not to be aware of the existence of COMPACTING GCs,
> techniques well mastered since the 70's,
> which even my plain old HP28 support
> (actually, that's where I learnt about GC for the first time),
> and that solve any memory fragmentation problem.
  They don't completely solve the memory fragmentation problem.  The problem
is, to compact memory, you have to move blocks around.	That changes the
starting address of the block of memory.  So you HAVE that problem to
contend with.  You can deal with that by using indirect pointers, and THAT
can't be compacted (although, since each entry is the same size, the
fragmentation problem isn't that bad.  The problem comes when you exceed the
size of the indirect pointer block).

>    Now your point is that dynamic allocation would be inefficient
> (while it needn't be)
> and that you want to just forbid it
> (which makes it infinitely inefficient).
  Not forbid it, just seeing if there's a better way.

  -spc (Oh boy!  Virtual memory!  Now I'm gonna make myself a really big
	RAM drive!)