[gclist] RE: synchronization cost (was: Garbage collection an d XML)

Boehm, Hans hans_boehm@hp.com
Wed, 14 Mar 2001 09:21:02 -0800


> Much discussion of CMPXCHG on Intels has gone by.  However, it's worth
> pointing out that Pentium-class processors allow you to do all sorts
> of locked operations.  From the IA-64 manual, Pentium ISA section [p
> 5-261, LOCK, if you care; the older Pentium manuals agree on this
> information]: 
> 
>   The LOCK prefix can be prepended only to the following instructions
>   and to those forms of the instructions that use a memory operand:
>   ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB,
>   SUB, XOR, XADD, and XCHG. ...
> 
> ...
>
> For me, the result of this observation was simple.  I wrote interfaces
> which provide the atomic operations I actually require for my system:
> exchange, compare and swap, bit vector operations, and so forth.  On a
> Pentium, these each have their own inline assembly.  On another
> architecture, such as SPARC, I roll them using compare and swap.

Extrapolating from the posted results, I would guess that, assuming all
cache
hits, implementing say fetch-and-add using CMPXCHG costs somewhere in the
25-27
cycle range, where LOCK; ADD is probably around 20.  I would guess that on
architectures using LL-SC, the difference is vaguely comparable.  And it's
tiny compared to the difference between either of these and wrapping the
operation in a lock.  It's probably still worthwhile in many cases, and I
will probably go back and do it in my collector code, but it currently seems
to be an engineering tradeoff between less machine-dependent code and saving
a few cycles.  Now if there were a standard library that implemented all the
variants efficiently so that everyone didn't have to reimplement them ...

> I'm in the odd situation of doing some synchronization directly in
> compiler-generated code (I'm generating C); I suspect most similar
> systems restrict synchronization to run-time routines.  In my case gcc
> seems to do a noticeably better job of optimization if I avoid "asm
> volatile" entirely in favor of "asm" and a correct set of instruction
> effects.  This isn't surprising, really; what is surprising is how
> infrequently it seems to be done in others' code.

Does it really make much difference if one of the effects is "memory"?
For updating a mark bit it wouldn't need to include that.  In many other
cases, you do want the compiler to treat it as a memory barrier.

I have recently tended to specify both "memory" and "volatile", largely
out of paranoia.  I think linuxthreads is similar.  Is there a cheaper
way to ensure that gcc preserves the memory barrier semantics?  (I agree
that it's not needed when you don't need the barrier.)

Hans