home.social

#stdsimd — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #stdsimd, aggregated by home.social.

  1. Two more results. This time without using std::simd. One uses a plain loop over C[i, j] += A[i, k] * B[k, j] (in the inner kernel—it is still blocked over all levels of the cache hierarchy).

    This is ~10–30x slower.

    1/2

    #stdsimd #cpp26 #simd

  2. Two more results. This time without using std::simd. One uses a plain loop over C[i, j] += A[i, k] * B[k, j] (in the inner kernel—it is still blocked over all levels of the cache hierarchy).

    This is ~10–30x slower.

    1/2

    #stdsimd #cpp26 #simd

  3. Two more results. This time without using std::simd. One uses a plain loop over C[i, j] += A[i, k] * B[k, j] (in the inner kernel—it is still blocked over all levels of the cache hierarchy).

    This is ~10–30x slower.

    1/2

    #stdsimd #cpp26 #simd

  4. mdspan rocks! A simple switch to go from layout_right to layout_right_padded and performance for larger matrices goes 📈 up! (e.g. 4096×4096 from 76GFLOP/s to 100GFLOP/s) I introduced padding of one cache line between rows to avoid cache associativity virtually reducing cache sizes.
    For small matrices the extra padding is counterproductive, though. But mdspan abstracts it all away. The matrix-mul function is unchanged.

    #stdsimd #mdspan #cpp26 #optimization

  5. I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
    I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
    I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.

    #stdsimd #simd #mdspan #cpp26 #cpp

  6. I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
    I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
    I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.

    #stdsimd #simd #mdspan #cpp26 #cpp

  7. I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
    I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
    I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.

    #stdsimd #simd #mdspan #cpp26 #cpp

  8. I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
    I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
    I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.

    #stdsimd #simd #mdspan #cpp26 #cpp

  9. I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
    I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
    I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.

    #stdsimd #simd #mdspan #cpp26 #cpp

  10. I'm a bit sad today. Yesterday I pushed forge.sourceware.org/gcc/gcc-m, which makes a simple `x + 1` ill-formed: compiler-explorer.com/z/4rYx87. Now, in generic code, you write `+ std::cw<1>` instead. If you know the value-type (`float` in this case), just use the appropriate literal (if it exists): `x + 1.f`.

    #stdsimd #cpp26