# FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2.

This seems to be verified here, How do I achieve the theoretical maximum of 4 FLOPs per cycle? ,and here, Sandy-Bridge CPU specification.

However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd.

Can someone explain this to me?

Edit: I understand now why I was confused. I thought the term FLOP only referred to single floating point (SP). I see now that the test at How do I achieve the theoretical maximum of 4 FLOPs per cycle? are actually on double floating point (DP) so they achieve 4 DP FLOPs/cycle for SSE and 8 DP FLOPs/cycle for AVX. It would be interesting to redo these test on SP.

## Answers

Here are theoretical max FLOPs counts (**per core**) for a number of recent processor microarchitectures and explanation how to achieve them.

In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply
(FMAs per clock) * (vector elements / instruction) * 2 (FLOPs / FMA).
Note that achieving this in real code requires very careful tuning (like loop unrolling), and near-zero cache misses, and no bottlenecks on anything *else*. Modern CPUs have such high FMA throughput that there isn't much room for other instructions to store the results, or to feed them with input. e.g. 2 SIMD loads per clock is also the limit for most x86 CPUs, so a dot product will bottleneck on 2 loads per 1 FMA. A carefully-tuned dense matrix multiply can come close to achieving these numbers, though.

If your workload includes any ADD/SUB or MUL that can't be contracted into FMAs, the theoretical max numbers aren't an appropriate goal for your workload. Haswell/Broadwell have 2-per-clock SIMD FP multiply (on the FMA units), but only 1 per clock SIMD FP add (on a separate vector FP add unit with lower latency). Skylake dropped the separate SIMD FP adder, running add/mul/fma the same at 4c latency, 2-per-clock throughput, for any vector width.

##### Intel

Note that Celeron/Pentium versions of recent microarchitectures don't support AVX or FMA instructions, only SSE4.2.

Intel Core 2 and Nehalem (SSE/SSE2):

- 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
- 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

Intel Sandy Bridge/Ivy Bridge (AVX1):

- 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
- 16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication

Intel Haswell/Broadwell/Skylake/Kaby Lake/Coffee/... (AVX+FMA3):

- 16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
- 32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
- (Using 256-bit vector instructions can reduce max turbo clock speed on some CPUs.)

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (**AVX512F**) with **1 FMA units**: some Xeon Bronze/Silver

- 16 DP FLOPs/cycle: one 8-wide FMA (fused multiply-add) instruction
- 32 SP FLOPs/cycle: one 16-wide FMA (fused multiply-add) instruction
- Same computation throughput as with narrower 256-bit instructions, but speedups can still be possible with AVX512 for wider loads/stores, a few vector operations that don't run on the FMA units like bitwise operations, and wider shuffles.
- (Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also
**reduces the max turbo clock speed**, so "cycles" isn't a constant in your performance calculations.)

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (**AVX512F**) with **2 FMA units**: Xeon Gold/Platinum, and i7/i9 high-end desktop (HEDT) chips.

- 32 DP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
- 64 SP FLOPs/cycle: two 16-wide FMA (fused multiply-add) instructions
- (Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also reduces the max turbo clock speed.)

Future: Intel Cooper Lake (successor to Cascade Lake) is expected to introduce Brain Float, a float16 format for neural-network workloads, with support for actual SIMD computation on it, unlike the current F16C extension that only has support for load/store with conversion to float32. This should double the FLOP/cycle throughput vs. single-precision on the same hardware.

Current Intel chips only have actual computation directly on standard float16 in the iGPU.

##### AMD

AMD K10:

- 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
- 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):

- 8 DP FLOPs/cycle: 4-wide FMA
- 16 SP FLOPs/cycle: 8-wide FMA

AMD Ryzen

- 8 DP FLOPs/cycle: 4-wide FMA
- 16 SP FLOPs/cycle: 8-wide FMA

##### x86 low power

Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):

- 1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
- 6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle

AMD Bobcat:

- 1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
- 4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle

AMD Jaguar:

- 3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
- 8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle

##### ARM

ARM Cortex-A9:

- 1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle
- 4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle

ARM Cortex-A15:

- 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
- 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

Qualcomm Krait:

- 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
- 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

##### IBM POWER

IBM PowerPC A2 (Blue Gene/Q), per core:

- 8 DP FLOPs/cycle: 4-wide QPX FMA every cycle
- SP elements are extended to DP and processed on the same units

IBM PowerPC A2 (Blue Gene/Q), per thread:

- 4 DP FLOPs/cycle: 4-wide QPX FMA every other cycle
- SP elements are extended to DP and processed on the same units

##### Intel MIC / Xeon Phi

Intel Xeon Phi (Knights Corner), per core:

- 16 DP FLOPs/cycle: 8-wide FMA every cycle
- 32 SP FLOPs/cycle: 16-wide FMA every cycle

Intel Xeon Phi (Knights Corner), per thread:

- 8 DP FLOPs/cycle: 8-wide FMA every other cycle
- 16 SP FLOPs/cycle: 16-wide FMA every other cycle

Intel Xeon Phi (Knights Landing), per core:

- 32 DP FLOPs/cycle: two 8-wide FMA every cycle
- 64 SP FLOPs/cycle: two 16-wide FMA every cycle

The reason why there are per-thread and per-core datum for IBM Blue Gene/Q and Intel Xeon Phi (Knights Corner) is that these cores have a higher instruction issue rate when running more than one thread per core.

The throughput for Haswell is lower for addition than for multiplication and FMA. There are two multiplication/FMA units, but only one f.p. add unit. If your code contains mainly additions then you have to replace the additions by FMA instructions with a multiplier of 1.0 to get the maximum throughput.

The latency of FMA instructions on Haswell is 5 and the throughput is 2 per clock. This means that you must keep 10 parallel operations going to get the maximum throughput. If, for example, you want to add a very long list of f.p. numbers, you would have to split it in ten parts and use ten accumulator registers.

This is possible indeed, but who would make such a weird optimization for one specific processor?