CPU fun
Introduction
In this post we’ll look at the performance of a simple atomic operation on a couple of Arm® AArch64 machines. In particular we’ll show the improvement that comes from using the simple, single-instruction, atomics in the Arm V8.1a architecture in preference to the more general Load-Locked, Store-Conditional (LL-SC) implementation in the earlier architectures. The improved performance of the newer architecture was mentioned in a tweet, so as I already had a benchmark for this for “The Book”, re-running those benchmarks and writing this up seemed worthwhile.
The Problem
Atomics
In a parallel program there are occasions when different threads need to update shared state in a safe way. At a high level that can be achieved using locks and critical sections. However, that just pushes the problem down a level since the locks themselves must be implemented. That leads us (and hardware architects!) to realise that the hardware must provide instructions which