[gclist] synchronization cost (was: Garbage collection and XM L)

Boehm, Hans hans_boehm@hp.com
Fri, 9 Mar 2001 14:31:23 -0800

> -----Original Message-----
> From: Emery Berger [mailto:emery@cs.utexas.edu]
> http://developer.intel.com/design/pentium4/manuals/24547203.pdf
> See Chapter 7.1. "For the P6 family processors, locked 
> operations serialize
> all outstanding load and store operations (that is, wait for them to
> complete). This rule is also true for the Pentium 4 
> processor, with one
> exception: load operations that reference weakly ordered 
> memory types (such
> as the WC memory type) may not be serialized. "

Thanks for the pointer.  This is very interesting.

I read the above statement as dealing more with the memory model than the
implementation.  The processor is normally allowed to move reads to before
logically earlier writes, assuming this is locally consistent.  It may not
do this if the read is part of the atomic operation or a later read.  Thus I
assume it basically has to wait for any store buffers to drain to the cache
before beginning the read.  That seems like an unavoidable cost given the
way the operation is defined.  It doesn't imply to me that the rest of the
processor necessarily has to be idle.

The following statement is also enlightening:

"Because frequently used memory locations are often cached in
a processor's L1 or L2 caches, atomic operations can often be carried out
inside a processor's
caches without asserting the bus lock. Here the processor's cache coherency
protocols insure
that other processors that are caching the same memory locations are managed
properly while
atomic operations are performed on cached memory locations."

Later text is explicit that for P6 and later, the bus is NOT locked for
atomic operations if the processor already has exclusive access to the cache
line.  I believe this is similar to most other recent processors.