home.social

#mdspan — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #mdspan, aggregated by home.social.

  1. I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
    I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
    I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.

    #stdsimd #simd #mdspan #cpp26 #cpp

  2. I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
    I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
    I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.

    #stdsimd #simd #mdspan #cpp26 #cpp

  3. I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
    I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
    I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.

    #stdsimd #simd #mdspan #cpp26 #cpp

  4. I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
    I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
    I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.

    #stdsimd #simd #mdspan #cpp26 #cpp

  5. I've been looking into matrix multiplication using std::simd and std::mdspan/submdspan (all single-threaded).
    I got to 86% of peak FLOP. x86_64 AVX2 has 32/16 FLOP/cycle peak (2 FMAs per cycle).
    I suspect better performance needs a more cache-friendly layout mapping. This is using layout_right.

    #stdsimd #simd #mdspan #cpp26 #cpp