And I forgot to mention that EnterCriticalSection takes (I have read)
about 6 CPU cycles in optimal case.
I have seen implementations of non-reentrant spin locks
that take a 5 cycles per lock (implemented using
LOCK and XCHG and MOV. LOCK takes 1 CPU cycle, XCHG takes
3 CPU cycles and MOV takes 1 cycle).