#pagedattention — Public Fediverse posts on home.social

TechLİfe @[email protected] · 2026-02-16 · 06:05 UTC

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

https://techlife.blog/posts/llm-inference-optimization/

#LLM #Inference #PagedAttention #vLLM #FlashAttention #SpeculativeDecoding #MachineLearning #GPUOptimization #KVCache

#llm #inference #pagedattention #vllm #flashattention #speculativedecoding

TechLİfe @[email protected] · 2026-02-16 · 06:05 UTC

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

https://techlife.blog/posts/llm-inference-optimization/

#LLM #Inference #PagedAttention #vLLM #FlashAttention #SpeculativeDecoding #MachineLearning #GPUOptimization #KVCache

#llm #inference #pagedattention #vllm #flashattention #speculativedecoding

TechLİfe @[email protected] · 2026-02-16 · 06:05 UTC

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

https://techlife.blog/posts/llm-inference-optimization/

#LLM #Inference #PagedAttention #vLLM #FlashAttention #SpeculativeDecoding #MachineLearning #GPUOptimization #KVCache

#llm #inference #pagedattention #vllm #flashattention #speculativedecoding

TechLİfe @[email protected] · 2026-02-16 · 06:05 UTC

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

https://techlife.blog/posts/llm-inference-optimization/

#LLM #Inference #PagedAttention #vLLM #FlashAttention #SpeculativeDecoding #MachineLearning #GPUOptimization #KVCache

#kvcache #gpuoptimization #machinelearning #speculativedecoding #flashattention #vllm

TechLİfe @techlife_blog · 2026-02-16 · 06:05 UTC

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

https://techlife.blog/posts/llm-inference-optimization/

#LLM #Inference #PagedAttention #vLLM #FlashAttention #SpeculativeDecoding #MachineLearning #GPUOptimization #KVCache

#llm #inference #pagedattention #vllm #flashattention #speculativedecoding

Habr @[email protected] · 2026-01-12 · 11:42 UTC

[Перевод] Как работает кэширование промптов — PagedAttention и автоматическое кэширование префикса плюс практические советы

Prompt caching часто обсуждают как «бонусную опцию» в API-прайсе: мол, попал в кэш — дешевле и быстрее. В статье разбираем, что за этим стоит на самом деле: почему кэш — это не «память диалога», а переиспользование KV-тензоров на уровне одинаковых префиксов, как из этого вырастает PagedAttention/vLLM с блоками и хэш-цепочками, и какие мелкие, но фатальные детали (динамический системный промпт, недетерминированный JSON, перестановка tool defs) мгновенно превращают кэш в тыкву. Как это устроено

https://habr.com/ru/companies/otus/articles/984434/

#prompt_caching #префилл #декодинг #инференс_LLM #vLLM #PagedAttention #prefix_caching #фрагментация_памяти #планировщик_инференса

#планировщик_инференса #фрагментация_памяти #prefix_caching #pagedattention #vllm #инференс_llm