atomic operation cost
What is the cost of the atomic operation (any of compare-and-swap or atomic add/decrement)? How much cycles does it consume? Will it pause other processors on SMP or NUMA, or will it block memory accesses? Will it flush reorder buffer in out-of-order CPU?
What effects will be on the cache?
I'm interested in modern, popular CPUs: x86, x86_64, PowerPC, SPARC, Itanium.
I have looked for actual data for the past days, and found nothing. However, I did some research, which compares the cost of atomic ops with the costs of cache misses.
The cost of the x86 LOCK prefix, or CAS, before PentiumPro (as described in the doc), is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for Writeback (i.e. cacheable) memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cacheline is blocked (based on the link posted above).
Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer.
Before going into too much detail, I'll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses.
He explains that a load + CAS on a single location can actually cost two cache misses, like Load-Linked/Store-Conditional (see there for the latter). His explaination relies on knowledge of the MESI cache coherence protocol. It uses 4 states for a cacheline: M(odified), E(xclusive), S(hared), I(nvalid) (and therefore it's called MESI), explained below where needed. The scenario, explained, is the following:
- the LOAD causes a cache miss - the relevant cacheline is loaded from memory in Shared state (i.e. other processors are still allowed to keep that cacheline in memory; no changes are allowed in this state). If the location is in memory, this cache miss is skipped. Possible cost: 1 cache miss. (skipped if the cacheline is in Shared, Exclusive or Modified state, i.e. the data is in this CPU's L1 cache).
- the program calculates the new values to store,
- and it runs an atomic CAS instruction.
- It has to avoid concurrent modification, so it must remove copies of the cacheline from the cache of other CPUs, to move the cacheline to the Exclusive state. Possible cost: 1 cache miss. This is not needed if it is already exclusively owned, i.e. in the Exclusive or Modified state. In both states, no other CPUs hold the cacheline, but in the Exclusive state it has not been modified (yet).
- After this communication, the variable is modified in our CPU's local cache, at which point it is globally visible to all other CPUs (because their caches are coherent with ours). It will eventually be written to main memory according to usual algorithms.
- Other processors trying to read or modify that variable will first have to obtain that cacheline in Shared or Exclusive mode, and to do so will contact this processor and receive the updated version of the cacheline. A LOCKed operation, instead, can only cost a cache miss (because the cacheline will be requested directly in Exclusive state).
In all cases, a cacheline request can be stalled by other processors already modifying the data.
I did some profiling with the following setup: The test machine (AMD Athlon64 x2 3800+) was booted, switched to long mode (interrupts disabled) and the instruction of interest was executed in a loop, 100 iterations unrolled and 1,000 loop cycles. The loop body was aligned to 16 bytes. The time was measured with an rdtsc instruction before and after the loop. Additionally a dummy loop without any instruction was executed (which measured 2 cycles per loop iteration and 14 cycles for the rest) and the result was substracted from the result of the instruction profiling time.
The following instructions were measured:
- "lock cmpxchg [rsp - 8], rdx" (both with comparison match and mismatch),
- "lock xadd [rsp - 8], rdx",
- "lock bts qword ptr [rsp - 8], 1"
In all cases the time measured was about 310 cycles, the error was about +/- 8 cycles
This is the value for repeated execution on the same (cached) memory. With an additional cache miss, the times are considerably higher. Also this was done with only one of the 2 cores active, so the cache was exclusively owned, and no cache synchonisation was required.
To evaluate the cost of a locked instruction on a cache miss, I added a wbinvld instruction before the locked instruction and put the wbinvld plus an add [rsp - 8], rax into the comparison loop. In both cases the cost was about 80,000 cycles per instruction pair! In case of lock bts the time difference was about 180 cycles per instruction.
Note that this is the reciprocal throughput, but since locked operations are serializing operations, there is probably no difference to latency.
Conclusion: a locked operation is heavy, but a cache miss can be much heavier. Also: a locked operation does not cause cache misses. It can only cause cache synchronisation traffic, when a cacheline is not owned exclusively.
To boot the machine, I used an x64 version of FreeLdr from the ReactOS project. Here's the asm source code:
#define LOOP_COUNT 1000 #define UNROLLED_COUNT 100 PUBLIC ProfileDummy ProfileDummy: cli // Get current TSC value into r8 rdtsc mov r8, rdx shl r8, 32 or r8, rax mov rcx, LOOP_COUNT jmp looper1 .align 16 looper1: REPEAT UNROLLED_COUNT // nothing, or add something to compare against ENDR dec rcx jnz looper1 // Put new TSC minus old TSC into rax rdtsc shl rdx, 32 or rax, rdx sub rax, r8 ret PUBLIC ProfileFunction ProfileFunction: cli rdtsc mov r8, rdx shl r8, 32 or r8, rax mov rcx, LOOP_COUNT jmp looper2 .align 16 looper2: REPEAT UNROLLED_COUNT // Put here the code you want to profile // make sure it doesn't mess up non-volatiles or r8 lock bts qword ptr [rsp - 8], 1 ENDR dec rcx jnz looper2 rdtsc shl rdx, 32 or rax, rdx sub rax, r8 ret
On bus-based SMP, atomic prefix LOCK does assert (turn on) a bus wire signal LOCK#. It will prohibit other cpus/devices on the bus for using it.
Locked instructions are serializing, synchronizing operations .... /about Out-of-order/ locked RMW/read-modify-write = atomic itself/ instruction ensures that the processor will execute all instructions before the locked instruction prior to executing it. /about yet not flushed writes/ it forces all posted writes within the processor to be flushed to external memory before executing the next instruction.
/about SMP/ semaphore is in cache in S state... issuing a read and invalidate transaction for 0 bytes of date (this is a kill/of shared copies of the cache line in adjacent CPUs/)