overhead (was: Could we use multiple stacks?)
Sean 'Captain Napalm' Conner
spc@gate.net
Wed, 29 May 1996 17:59:55 -0400 (EDT)
On some network somewhere in cyberspace, Jecel Assumpcao Jr transmitted:
> Sean 'Captain Napalm' Conner wrote:
>
> [here are some parts of your example (I am trying to make this short)]
> > ; get first node in run queue
> > ; and start that
> > mov eax,[runqueuehead] ; 4
> > mov ebx,[eax.node.prev] ; 4
> > mov ecx,[eax.node.next] ; 4
> > mov [ebx.node.next],ecx ; 2
> > mov [ecx.node.prev],ebx ; 2
> > ; next two can be commented out for extra speed if required
> > xor [eax.node.prev],ebx ; 6
> > xor [eax.node.next],ecx ; 6
> > ; make this the running task
> > mov [thistask],eax
> > ; restore all state information
> > mov ebx,[eax.pagetable] ; 4
> > mov cr3,ebx ; 5
> > mov esp,[eax.stack] ; 4
> > pop ebp ; 4 [2]
> > pop edi ; 4
> > pop esi ; 4
> > pop edx ; 4
> > pop ecx ; 4
> > pop ebx ; 4
> > pop eax ; 4
> > ; we're done! Well, how about that! All's that left is to
> > ; return with IRETD, and I'll use the worse case senario for
> > ; that!
> > iretd ; 82
>
> > [2] This sequence (poping of registers) takes 28 cycles, while the POPAD
> > takes 24. While we saved 4 cycles in pushing the registers (if we
> > push esp to simulate the effect of PUSHAD (and cut our savings down
> > to 2 cycles)) it looks to be lost here. Again, a test is needed
> > with the various combinations to get the best real-world timings.
> >
> > We now have a total running time (including the extra two instructions
> > that can (if required) be cut and including the 99 cycles interrupt
> > overhead) of 292 cycles. On a 33MHz system, that's 8,848 nSecs, well within
> > my 10uSec overhead handling case 8-) and will allow us to switch 113,013
> > tasks/second (if that's all we're doing, which makes for some good bench
> > marks).
>
> This is what I would like to comment about. The sequence of
> pops seems pretty cheap, but imagine that they are happening
> in a 133MHz Pentium and are causing cache misses (which would
> be typical if the thread hasn't been one of the last ones
> executed). I don't have any manuals with me so can't give you
> the numbers, but I'll bet that the number of simple instructions
> (like ADD EBX,ECX) that you could execute in the same number
> of cycles (which will certainly be way over 28 clocks) will
> be rather large.
>
On the 386 (or higher) there are only two ways to pop registers off the
stack:
pop <register>
or
mov <register>,[esp]
add esp,4
(forgive, but all my 386 references are at home, and I'm not. I think I got
the order correct)
But it doesn't matter which method I use, because there may still be a
cache miss. And if there's a cache miss, it doesn't matter if the
instruction is a POP or an ADD, there still a penalty, so I'm not sure what
point you are making.
If you are saying that cache misses are more frequent in a preemptive
system than cooperative, then I don't think that's true, as I can blow the
cache all to hell just as easily in a cooperative system as in a preemptive
system by randomly referencing memory (and in a cooperative system, you
yeild, it's still possible that the thread that gets control next isn't in
the cache).
> And certain processors have special registers or register
> banks for interrupt routines (Alpha and ARM, for example)
> and so the overhead of handling interrupts via full threads
> is even greater.
>
But the routine I presented above IS the interrupt handler. All it did
was cause a task switch to handle the event (the interrupt handler doesn't
handle it, it's just there to make sure I start a response within 10uSecs).
> > ESP. Stack pointer. Perhaps you're familiar with it? Just point ESP off
> > to a new region in memory. Each "continuation" has it's own copy of the
> > stack pointer, whether it's ESP, A7, S, R14 or R30 (depending upon the
> > architecture). Am I missing something here?
>
> You are right, in general. But if you allow multiple executions
> of a continuation then you have to copy the stack so one execution
> doesn't mess up the rest. Some systems use the MMU to create
> copy-on-write stacks so you don't have the overhead of copying
> except in the cases where you really need it.
>
I get the feeling that a continuation is similar to (although not quite
the same as) fork() under Unix (only in this case, it's more fine grained).
I'm still not sure what a continuation exactly is, so either a formal
definition (with example) or a pointer to such would be nice, so I at least
know what is being talked about.
-spc (But wouldn't you just give each continaution its own stack?)