#cudatiles — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #cudatiles, aggregated by home.social.

AI Daily Post @[email protected] · 2026-03-07 · 01:15 UTC

New benchmark shows that larger CUDA tiles can cut Flash Attention throughput by 18‑43 % across sequence lengths. The study dives into kernel design, TFLOPS loss, and what it means for transformer model efficiency on NVIDIA GPUs. Open‑source researchers can use these insights to tune their kernels and reclaim performance. #FlashAttention #CUDATiles #GPUPerformance #TFLOPS
🔗 https://aidailypost.com/news/large-cuda-tiles-reduce-flash-attention-tflops-by-1843-across

#flashattention #cudatiles #gpuperformance #tflops
AI Daily Post @[email protected] · 2026-03-07 · 01:15 UTC

New benchmark shows that larger CUDA tiles can cut Flash Attention throughput by 18‑43 % across sequence lengths. The study dives into kernel design, TFLOPS loss, and what it means for transformer model efficiency on NVIDIA GPUs. Open‑source researchers can use these insights to tune their kernels and reclaim performance. #FlashAttention #CUDATiles #GPUPerformance #TFLOPS
🔗 https://aidailypost.com/news/large-cuda-tiles-reduce-flash-attention-tflops-by-1843-across

#tflops #gpuperformance #cudatiles #flashattention
AI Daily Post @[email protected] · 2026-03-07 · 01:15 UTC

New benchmark shows that larger CUDA tiles can cut Flash Attention throughput by 18‑43 % across sequence lengths. The study dives into kernel design, TFLOPS loss, and what it means for transformer model efficiency on NVIDIA GPUs. Open‑source researchers can use these insights to tune their kernels and reclaim performance. #FlashAttention #CUDATiles #GPUPerformance #TFLOPS
🔗 https://aidailypost.com/news/large-cuda-tiles-reduce-flash-attention-tflops-by-1843-across

#flashattention #cudatiles #gpuperformance #tflops