[gclist] synchronization cost (was: Garbage collection and XML)

Chris Dodd chrisd@reservoir.com
Sat, 10 Mar 2001 00:43:24 -0800 (Pacific Standard Time)

> (1)    A few years ago, I had opportunity to do some measurements
> on CMPXCHG and if I remember correctly, the preceding figure is
> pretty close to what I got -- I was reading about 30-40 instructions per
> cmpxchg
> and more on cmpxchg8b on pentium II, 200 Mhz.  (Windows NT4.0).
Was this a simple cmpxchg or a lock+cmpxchg?  The former is much faster,
but of course isn't atomic on a multiprocessor machine.

> (2)   If one is trying to use a faster locking mechanism
> for the garbage collector on Windows NT (single process,
> multithreaded), one might consider EnterCriticalSection.
> For many cases, it is MUCH faster than
> using mutexes and other synchronization mechanisms.
> (likely to be based on CMPXCHG).
>     However, see
> http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
> (3) Does anyone know how EnterCriticalSeciton is implemented?
> I tried writing semaphores, mutexses, shared semaphores,
> based on CMPXCHG, CMPXCHG8, but my implementations
> were always much slower than EnterCriticalSection.  I had
> suspicion that it was not using CMPXCHG, and
> that was the reason why it could be so
> fast.  But I could never be sure.

Well, one important "optimization" that WinNT does is to have TWO
DIFFERENT versions of EnterCriticalSection -- one for uniprocessor
machines and one for multiprocessors.  The UP version is considerably
faster.  I'm pretty sure the way they do that is to not use lock
prefixes in the UP version, since on a UP x86 machine, individual
instructions are (mostly) atomic.  On an MP machine, you need a
lock prefix to make them atomic.  I certainly had no difficultly
writing a lock+cmpxchg based mutex that was faster than the MP
version of EnterCriticalSection.  The exact same code without the
lock prefix was faster than the UP version of EnterCriticalSection.

Chris Dodd