[gclist] RE: synchronization cost (was: Garbage collection and XML)

Jan-Willem Maessen jmaessen@mit.edu
Wed, 14 Mar 2001 11:15:08 -0500


Much discussion of CMPXCHG on Intels has gone by.  However, it's worth
pointing out that Pentium-class processors allow you to do all sorts
of locked operations.  From the IA-64 manual, Pentium ISA section [p
5-261, LOCK, if you care; the older Pentium manuals agree on this
information]: 

  The LOCK prefix can be prepended only to the following instructions
  and to those forms of the instructions that use a memory operand:
  ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB,
  SUB, XOR, XADD, and XCHG. ...

That's a pretty long list, and one of the above instructions is
usually what you actually want.  For example (getting back to GC
here), BTC/BTR/BTS allow you to atomically update a shared
allocation/mark bitmap efficiently.

As a datapoint, when I was first designing the synchronization for
Eager Haskell (lots of shared updates and a global GC'd heap on an
SMP), I took a look at the synchronization in the Linux kernel.  The
result was enlightening:

$ cat /proc/version
Linux version 2.2.12-20 (root@lauzeta.mit.edu) [...]
$ cd /usr/src/linux
$ find . -name '*.[ch]' -print | xargs fgrep 'cmpxchg'
./drivers/usb/uhci.c:		asm volatile("lock ; cmpxchg %4,%2 ; sete %0"
./drivers/usb/uhci.c:	asm volatile("lock ; cmpxchg %0,%1"

Only one file with a cmpxchg---in the [I believe] then-experimental
usb code.  There are tons of calls to xchg, and many atomic
increments, bit set/clear operations, and the like (you can find these
using a similar command, though it's a bit more work separating wheat
from chaff).

For me, the result of this observation was simple.  I wrote interfaces
which provide the atomic operations I actually require for my system:
exchange, compare and swap, bit vector operations, and so forth.  On a
Pentium, these each have their own inline assembly.  On another
architecture, such as SPARC, I roll them using compare and swap.

I'm in the odd situation of doing some synchronization directly in
compiler-generated code (I'm generating C); I suspect most similar
systems restrict synchronization to run-time routines.  In my case gcc
seems to do a noticeably better job of optimization if I avoid "asm
volatile" entirely in favor of "asm" and a correct set of instruction
effects.  This isn't surprising, really; what is surprising is how
infrequently it seems to be done in others' code.

I'll close with a question I have not yet managed to answer.  Our GC
uses an unshared nursery.  Right now, I test for nusery-ness on some
paths in order to determine whether to perform a write barrier.  The
same test allows me to perform synchronization (in this case a
Store/Store fence) only on shared objects.  The open question: is it
worth checking for locality before _every_ such synchronization?  If
so, it is not worth eliminating the test from my write barrier code,
and I should add a similar test to other code.  Otherwise, I should
use a test-free write barrier (blind card marking of some sort),
perform synchronization all over the place and rely on the fact that
the local stuff will happen in cache.  Has anyone done an experiment
like this on a multiprocessor?

-Jan-Willem Maessen
Eager Haskell project
jmaessen@mit.edu