Sign in Create account

#stdsimd — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #stdsimd, aggregated by home.social.

Matthias Kretz | Vir @[email protected] · 2026-05-08 · 07:50 UTC

Two more results. This time without using std::simd. One uses a plain loop over C[i, j] += A[i, k] * B[k, j] (in the inner kernel—it is still blocked over all levels of the cache hierarchy).
This is ~10–30x slower.
1/2
#stdsimd #cpp26 #simd

#stdsimd #cpp26 #simd
Matthias Kretz | Vir @[email protected] · 2026-05-08 · 07:50 UTC

Two more results. This time without using std::simd. One uses a plain loop over C[i, j] += A[i, k] * B[k, j] (in the inner kernel—it is still blocked over all levels of the cache hierarchy).
This is ~10–30x slower.
1/2
#stdsimd #cpp26 #simd

#stdsimd #cpp26 #simd
Matthias Kretz | Vir @[email protected] · 2026-05-08 · 07:50 UTC

Two more results. This time without using std::simd. One uses a plain loop over C[i, j] += A[i, k] * B[k, j] (in the inner kernel—it is still blocked over all levels of the cache hierarchy).
This is ~10–30x slower.
1/2
#stdsimd #cpp26 #simd

#stdsimd #cpp26 #simd
Matthias Kretz | Vir @[email protected] · 2026-05-07 · 19:08 UTC

mdspan rocks! A simple switch to go from layout_right to layout_right_padded and performance for larger matrices goes 📈 up! (e.g. 4096×4096 from 76GFLOP/s to 100GFLOP/s) I introduced padding of one cache line between rows to avoid cache associativity virtually reducing cache sizes.
For small matrices the extra padding is counterproductive, though. But mdspan abstracts it all away. The matrix-mul function is unchanged.
#stdsimd #mdspan #cpp26 #optimization

#stdsimd #mdspan #cpp26 #optimization
Matthias Kretz | Vir @[email protected] · 2026-05-07 · 15:47 UTC

I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.
#stdsimd #simd #mdspan #cpp26 #cpp

#stdsimd #simd #mdspan #cpp26 #cpp
Matthias Kretz | Vir @[email protected] · 2026-05-07 · 15:47 UTC

I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.
#stdsimd #simd #mdspan #cpp26 #cpp

#stdsimd #simd #mdspan #cpp26 #cpp
Matthias Kretz | Vir @[email protected] · 2026-05-07 · 15:47 UTC

I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.
#stdsimd #simd #mdspan #cpp26 #cpp

#stdsimd #simd #mdspan #cpp26 #cpp
Matthias Kretz | Vir @[email protected] · 2026-05-07 · 15:47 UTC

I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.
#stdsimd #simd #mdspan #cpp26 #cpp

#cpp #cpp26 #mdspan #simd #stdsimd
Matthias Kretz | Vir @[email protected] · 2026-05-07 · 15:47 UTC

I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.
#stdsimd #simd #mdspan #cpp26 #cpp

#stdsimd #simd #mdspan #cpp26 #cpp
Matthias Kretz | Vir @[email protected] · 2026-04-22 · 07:46 UTC

I'm a bit sad today. Yesterday I pushed https://forge.sourceware.org/gcc/gcc-mirror/commit/804bde962de4819138951aed24b2c8ba768d7344, which makes a simple `x + 1` ill-formed: https://compiler-explorer.com/z/4rYx87fcW. Now, in generic code, you write `+ std::cw<1>` instead. If you know the value-type (`float` in this case), just use the appropriate literal (if it exists): `x + 1.f`.
#stdsimd #cpp26

#stdsimd #cpp26