overhead (was: Could we use multiple stacks?)

Sean 'Captain Napalm' Conner spc@gate.net
Wed, 29 May 1996 17:59:55 -0400 (EDT)

On some network somewhere in cyberspace, Jecel Assumpcao Jr transmitted:
> Sean 'Captain Napalm' Conner wrote:
> [here are some parts of your example (I am trying to make this short)]
> >         ; get first node in run queue
> >         ; and start that
> >                 mov     eax,[runqueuehead]      ; 4
> >                 mov     ebx,[eax.node.prev]     ; 4
> >                 mov     ecx,[eax.node.next]     ; 4
> >                 mov     [ebx.node.next],ecx     ; 2
> >                 mov     [ecx.node.prev],ebx     ; 2
> >         ; next two can be commented out for extra speed if required
> >                 xor     [eax.node.prev],ebx     ; 6
> >                 xor     [eax.node.next],ecx     ; 6
> >         ; make this the running task
> >                 mov     [thistask],eax
> >         ; restore all state information
> >                 mov     ebx,[eax.pagetable]     ; 4
> >                 mov     cr3,ebx                 ; 5
> >                 mov     esp,[eax.stack]         ; 4
> >                 pop     ebp                     ; 4     [2]
> >                 pop     edi                     ; 4
> >                 pop     esi                     ; 4
> >                 pop     edx                     ; 4
> >                 pop     ecx                     ; 4
> >                 pop     ebx                     ; 4
> >                 pop     eax                     ; 4
> >         ; we're done!  Well, how about that!  All's that left is to
> >         ; return with IRETD, and I'll use the worse case senario for
> >         ; that!
> >                 iretd                           ; 82
> > [2]     This sequence (poping of registers) takes 28 cycles, while the POPAD
> >         takes 24.  While we saved 4 cycles in pushing the registers (if we
> >         push esp to simulate the effect of PUSHAD (and cut our savings down
> >         to 2 cycles)) it looks to be lost here.  Again, a test is needed
> >         with the various combinations to get the best real-world timings.
> > 
> >   We now have a total running time (including the extra two instructions
> > that can (if required) be cut and including the 99 cycles interrupt
> > overhead) of 292 cycles.  On a 33MHz system, that's 8,848 nSecs, well within
> > my 10uSec overhead handling case 8-) and will allow us to switch 113,013
> > tasks/second (if that's all we're doing, which makes for some good bench
> > marks).
> This is what I would like to comment about. The sequence of
> pops seems pretty cheap, but imagine that they are happening
> in a 133MHz Pentium and are causing cache misses (which would
> be typical if the thread hasn't been one of the last ones
> executed). I don't have any manuals with me so can't give you
> the numbers, but I'll bet that the number of simple instructions
> (like ADD EBX,ECX) that you could execute in the same number
> of cycles (which will certainly be way over 28 clocks) will
> be rather large.
  On the 386 (or higher) there are only two ways to pop registers off the

		pop	<register>


		mov	<register>,[esp]
		add	esp,4

(forgive, but all my 386 references are at home, and I'm not.  I think I got
the order correct)

  But it doesn't matter which method I use, because there may still be a
cache miss.  And if there's a cache miss, it doesn't matter if the
instruction is a POP or an ADD, there still a penalty, so I'm not sure what
point you are making.

  If you are saying that cache misses are more frequent in a preemptive
system than cooperative, then I don't think that's true, as I can blow the
cache all to hell just as easily in a cooperative system as in a preemptive
system by randomly referencing memory (and in a cooperative system, you
yeild, it's still possible that the thread that gets control next isn't in
the cache).

> And certain processors have special registers or register
> banks for interrupt routines (Alpha and ARM, for example)
> and so the overhead of handling interrupts via full threads
> is even greater.
  But the routine I presented above IS the interrupt handler.  All it did
was cause a task switch to handle the event (the interrupt handler doesn't
handle it, it's just there to make sure I start a response within 10uSecs).

> >   ESP.  Stack pointer.  Perhaps you're familiar with it?  Just point ESP off
> > to a new region in memory.  Each "continuation" has it's own copy of the
> > stack pointer, whether it's ESP, A7, S, R14 or R30 (depending upon the
> > architecture).  Am I missing something here?
> You are right, in general. But if you allow multiple executions
> of a continuation then you have to copy the stack so one execution
> doesn't mess up the rest. Some systems use the MMU to create
> copy-on-write stacks so you don't have the overhead of copying
> except in the cases where you really need it.
  I get the feeling that a continuation is similar to (although not quite
the same as) fork() under Unix (only in this case, it's more fine grained). 
I'm still not sure what a continuation exactly is, so either a formal
definition (with example) or a pointer to such would be nice, so I at least
know what is being talked about.

  -spc (But wouldn't you just give each continaution its own stack?)