[gclist] Garbage collection and XML

David Chase chase@world.std.com
Thu, 08 Mar 2001 19:43:10 -0500

At 02:31 PM 3/8/2001 -0800, Boehm, Hans wrote:
>> (*) non-recursive synchronization costs two bus locks,
>> or (my machine, with cpu clocked 6x memory bus) 120 cycles.
>> Heap memory allocation is load, add, compare, conditional
>> branch (predicted not taken), store, store, followed by
>> field initialization.
>Is that an X86 machine?

An x86 machine.  I was given to understand that the cost
was 10 memory bus cycles per lock-cmpxchgl, and I thought
that was what I measured on my old 200Mhz PPro.  Not sure
if it was clocking at 2x or 3x bus speed.  New machine is
(two) 800Mhz P-II, 133Mhz bus.

It's also somewhat chip(set) dependent; I've been told
that the Xeon chips lock only a subset of memory (a "line",
for some definition of the word) versus the entire bus.

>I just timed a Pentium III/500/100 machine at
>something near 25 cycles per "lock; cmpxchgl".
>I'm interested because I've sometimes heard the claim
>that X86 is particularly bad at this, but that
>hasn't really been consistent with my experience.

Maybe all the other chips are bad, too, but 25
cycles seems pretty horrible to me.  We made some
benchmarks to measure two versions of a tight
little loop (one always doing the cmpxchgl, the
other always branching around it) and it seemed
like a pair of locked instructions cost two or
three times as much as the entire rest of the loop,
which was not completely empty.

You're probably right about the cost of abandoning
that long-lived spine.  What wins depends on the
relative mix of probes versus updates, since the
applicative data structure let you avoid locking
on probes.  Don't forget, in some systems, if you
are modifying data structures in place, you are
probably doing some card-marking and creating
old-to-young pointers.  Those carry their own
extra costs.  (Avoiding the card-marking on newly
allocated memory is a cute trick, if you can
manage to do it.)

David Chase