[gclist] RE: synchronization cost (was: Garbage collection and XML)

David Chase chase@world.std.com
Wed, 14 Mar 2001 13:32:03 -0500

At 11:15 AM 3/14/2001 -0500, Jan-Willem Maessen wrote:
>Much discussion of CMPXCHG on Intels has gone by.  However, it's worth
>pointing out that Pentium-class processors allow you to do all sorts
>of locked operations.

>Only one file with a cmpxchg---in the [I believe] then-experimental
>usb code.  There are tons of calls to xchg, and many atomic
>increments, bit set/clear operations, and the like (you can find these
>using a similar command, though it's a bit more work separating wheat
>from chaff).

I can only speak for myself, but designing a VM for Java,
I worked with the instructions that I knew were available
on most architectures (Sparc V9, Pentium, PowerPC, M68040,
MIPS, notably NOT Sparc V8) and not all of those have a
general class of locked instructions.  It's possible
to simulate everything else with CAS or LL/SC, but in those
cases you probably would have been better off in the first
place designing algorithms that worked directly with CAS
or LL/SC.

I was also interested in building a system that was
starvation-free, and also that would scale efficiently.
Given limited intellectual bandwidth, and these constraints,
I decided to use wait-free internal data structures
designed around CAS or LL/SC.  I'd read the Herlihy papers
on this, and given the tricky world I was working on, this
seemed like the most practical choice.  It's possible I could
have come up with something faster that used (for instance)
exchange or atomic add, but what we've got now is quite
fast (we benchmark against the competition, obviously)
and quite scalable (synchronization never, ever, requires
allocation of additional anything, a huge simplification,
and a decoupling of two pieces of the VM that I am really
want to be decoupled).

The use of CAS also allows us to maintain always-globally-
consistent data structures, which is a very good thing --
for instance, our VM runs with deadlock-detection always
enabled for Java locks.  (In fact, this sped things
up a little on most benchmarks, which indicates that we
ought to do just a hair more busy-waiting before putting
a thread properly to sleep.)  In addition, using CAS
we were able to design synchronization that is slow only
in the following cases:

1) contended first acquisition
2) contended release

Anything else goes fast (single CAS) or faster (no CAS).

The internals of a parallel garbage collector are another
animal of course, as is the card-marking code.  That, we
generate directly from a compiler.  On Pentium, we simply
store bytes with no additional synchronization.  On a
machine with much weaker memory consistency, we might do
something else altogether.  My assumption is that the
conditionals and comparisons necessary to determine that
a card mark could be avoided are more expensive than simply
doing the mark blind.  I could be wrong, but modern
processors aren't very happy about conditional branches in
general.  This decision might be different on processors that
didn't provide consistency of byte stores across processors
(that is, if one cache line could overwrite another).
I don't care about apparent ordering, as long as a write
never disappears (because card-marking is monotonic
until a GC -- only the GC unmarks cards).

BUT, if you can avoid a memory bus lock, at least on
Pentium, you can afford the conditionals.  This I 
know from benchmarks, though I also know that the
branches are all extremely well-predicted.  
The conditional branch seems to add 5-10% to the cost
of the locked CAS, but subtracts 90-95% if the lock
can be avoided.  We detect when we are running on a
uniprocessor, and simply avoid using the locked
version of CAS in that case.  A JIT could compile
this in-line; we impose what is essentially a
small tax on the MP case.

>[ telegraphically described experiment elided]

>Has anyone done an experiment
>like this on a multiprocessor?

Not yet :-).  One experiment we could run, but have not yet,
is to determine the overhead of (optimized in various ways)
card marking.  We support two garbage collectors; we could
take a set of benchmarks, run them under full-copying with
card marks enabled, then recompile them w/o card marks and
rerun and measure the difference.

David Chase