[gclist] synchronization cost (was: Garbage collection and XM L)

Boehm, Hans hans_boehm@hp.com
Fri, 9 Mar 2001 10:07:08 -0800

Does anyone know if this is documented somewhere?

My experience with using "lock; cmpxchgl" to atomically set a mark bit in a
bit vector was that it didn't seem to have that much of an impact.  But the
only measurement I made was a comparison to using mark bytes instead, which
appeared to be slower on a Pentium III, presumably as a result of the larger
data structure and hence added cache misses.

Since these are out-of-order machines, the other question is whether
subsequent instructions that don't depend on later memory references will
continue to execute during the wait.  If so, this might explain some of the
diffferences in measurements.


> -----Original Message-----
> From: Emery Berger [mailto:emery@cs.utexas.edu]
> Sent: Thursday, March 08, 2001 6:51 PM
> To: Boehm, Hans; 'David Chase'; gclist@iecc.com
> Cc: icis-developers@bbn.com
> Subject: RE: [gclist] Garbage collection and XML
> > Is that an X86 machine?  I just timed a Pentium III/500/100 
> machine at
> > something near 25 cycles per
> > "lock; cmpxchgl".  I'm interested because I've sometimes 
> heard the claim
> > that X86 is particularly bad at this, but that hasn't really been
> > consistent
> > with my experience.  Is this chipset dependent, perhaps?
> Timing just the "lock; cmpxchgl" doesn't give you the whole 
> picture. The
> problem is that the Pentium flushes the pipeline when it 
> encounters a locked
> instruction. The performance penalty is pretty spectacular. 
> I'm told the P4
> has a 24-stage pipeline, so locked instructions will become 
> effectively even
> more expensive.
> Regards,
> -- Emery
> --
> Emery Berger
> emery@cs.utexas.edu
> http://www.cs.utexas.edu/users/emery