[gclist] RE: synchronization cost (was: Garbage collection an
d XML)
Boehm, Hans
hans_boehm@hp.com
Wed, 14 Mar 2001 09:21:02 -0800
> Much discussion of CMPXCHG on Intels has gone by. However, it's worth
> pointing out that Pentium-class processors allow you to do all sorts
> of locked operations. From the IA-64 manual, Pentium ISA section [p
> 5-261, LOCK, if you care; the older Pentium manuals agree on this
> information]:
>
> The LOCK prefix can be prepended only to the following instructions
> and to those forms of the instructions that use a memory operand:
> ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB,
> SUB, XOR, XADD, and XCHG. ...
>
> ...
>
> For me, the result of this observation was simple. I wrote interfaces
> which provide the atomic operations I actually require for my system:
> exchange, compare and swap, bit vector operations, and so forth. On a
> Pentium, these each have their own inline assembly. On another
> architecture, such as SPARC, I roll them using compare and swap.
Extrapolating from the posted results, I would guess that, assuming all
cache
hits, implementing say fetch-and-add using CMPXCHG costs somewhere in the
25-27
cycle range, where LOCK; ADD is probably around 20. I would guess that on
architectures using LL-SC, the difference is vaguely comparable. And it's
tiny compared to the difference between either of these and wrapping the
operation in a lock. It's probably still worthwhile in many cases, and I
will probably go back and do it in my collector code, but it currently seems
to be an engineering tradeoff between less machine-dependent code and saving
a few cycles. Now if there were a standard library that implemented all the
variants efficiently so that everyone didn't have to reimplement them ...
> I'm in the odd situation of doing some synchronization directly in
> compiler-generated code (I'm generating C); I suspect most similar
> systems restrict synchronization to run-time routines. In my case gcc
> seems to do a noticeably better job of optimization if I avoid "asm
> volatile" entirely in favor of "asm" and a correct set of instruction
> effects. This isn't surprising, really; what is surprising is how
> infrequently it seems to be done in others' code.
Does it really make much difference if one of the effects is "memory"?
For updating a mark bit it wouldn't need to include that. In many other
cases, you do want the compiler to treat it as a memory barrier.
I have recently tended to specify both "memory" and "volatile", largely
out of paranoia. I think linuxthreads is similar. Is there a cheaper
way to ensure that gcc preserves the memory barrier semantics? (I agree
that it's not needed when you don't need the barrier.)
Hans