[gclist] A problem with the Baker treadmill.

Paul R. Wilson wilson@cs.utexas.edu
Fri, 17 Jan 1997 11:18:45 -0600


>From majordom@iecc.com Tue Jan 14 17:06:18 1997
>
>James McCartney wrote:
>> >I can confirm that writing your own spinlocks is the
>> >right thing to do.
>> 
>> It doesn't matter if the lock is only for a few instructions if the
>> scheduler decides your time slice is up at that moment.
>> If you are trying to do anything real time on a multitasking system
>> you can't afford a spin lock. If a suspended thread holds the lock
>> then you will wind up spinning for your entire CPU time slice.
>> The only place where I could see using a spin lock would be on
>> a multiprocessor system where each processor runs single threaded.

Point of clarification:

I think what David is referring to is "competitive spinning", where
you spin briefly and then use a regular lock.  Anna Karlin did
some work on this approach and published some papers a couple
of years ago.

The "competitive argument" is that if your max spin time is say, equal
to the general lock time, then you'll never more than double the cost of
locking.  That is, the cost of doing it this way is always "competitive"
(within a constant factor) with general locking, and in the good cases
it's much cheaper.

In the worst case, where your lock is held by a non-running
thread, you uselessly spin for a lock time, and then really do the 
general locking, so you double your worst-case cost.  In the good cases,
where it's a real parallel thread competing for the lock, and the lock 
comes free quickly, you'll spin very briefly and get the lock much faster.

(If I've misunderstood the issue, please correct me.)

The above situation can be refined to adjust the tradeoffs.  E.g.,
if your locks only spin for a max of 1/3 the time of a general
lock, then you only increase your worst-case performance by 1/3.
If most of the time locks come free within that amount of time
you'll still get most of the benefit.  Where to set this tradeoff
depends on what you're trying to optimize, and what the empirical
distribution of lock wait times is in your system.

>The devil is in the details, and I agree the tradeoffs can be made
>to go the other way.  Here are some microbenchmark results for a 4
>cpu SS10 @ 40Mhz, times in usecs:
>
>   Uncontested lock                   1.33
>   Uncontested semaphore              6.29
>   Uncontested spinlock               0.31
>   Lock ping-pong                   256.06
>   Semaphore ping-pong              254.04
>   Spinlock ping-pong                 2.03

Wow.  This looks pretty bleak.

This is the kind of number that it's good to see in instructions
or instruction cycles.  Microsecond numbers usualy seem small,
but when you think about the fact that a fast new machine executes
200-500 instructions per microsecond, it starts to seem very slow
for write barriers.

I'd guess the old 40 MHz SS10 processors only execute very roughly
50 or so instructions per microsecond.  Still, that means that the 
smallest number here is on the order of 15 instructions---not cheap 
to be doing in a write barrier.  (At least not in the usual-case
code.)  If one in 100 instructions (say) is a store requiring
this write barrier synch, you've lost 15 percent of your system
performance right there.  If these locks ping-pong very often,
it's much worse.
  
The cost of anything but an uncontested spinlock appears to
be in the scores of instructions, or the hundreds, or even
thousands.  This suggests that you want to avoid synchronization
in the write barrier as much as possible, or it will be the main
cost of your GC---and a big fraction of your overall runtime.
(This is exactly what we observed with a preliminary, unoptimized
port of our GC to a high-performance multithreaded commercial
implementation of a strongly-typed language.)  Without a 
low-synchronization GC interface, you lose more in synch costs
than you gain in parallelism, so the whole thing's a waste.

Ping-ponging on general locks seems like a potentially horrendous 
problem given the above numbers---over 10,000 instruction cycles. (Can
that be right?)  Reasoning about locality seems tremendously important
for this system.  For real-time systems, this kind of consideration
can be paramount.  Uniprocessor locality is hard enough to reason
about, but sharing memory makes things very quirky.  

(For non-life-critical real-time systems, you can usually use a 
combination of simple reasoning, empirical testing, and conservatism
to get a reasonable assurance that your cache works well enough to count
on.  But with shared-memory multiprocessors, weird little accidents of
synchronization could make dramatic and unpredictable differences to
cache performance.  This applies to the actual program data as well
as the GC metadata.  Bleah.  Anybody out there have experience with 
such things?)

It would be very interesting to see similar numbers for newer
machines, like the Be Box and especially SMP Pentium Pro systems.
Has the shared memory hardware improved a whole lot compared to the
SS-10's, or have the processor speeds improved even more?

I've always had my doubts about automatic caching working acceptably
for anything but coarse-grained parallelism, unless processor speeds
top out and interconnect speeds get a chance to catch up.  Even then,
there's the speed of light in the way if it doesn't all end up on 
one chip.