#multi-token-prediction — Public Fediverse posts on home.social

GOMOOT :mastodon: @[email protected] · 2026-05-06 · 12:28 UTC

🔥 Gemma 4 riduce la latenza fino a 3x con i drafter Multi-Token: decodifica speculativa senza perdita di qualità
https://gomoot.com/gemma-4-accelera-linferenza-grazie-ai-drafter-multi-token/

#AIInference #gemma4 #GoogleAI #LLM #MultiTokenPrediction

#aiinference #gemma4 #googleai #llm #multitokenprediction

Hacker News @[email protected] · 2026-05-05 · 16:41 UTC

Accelerating Gemma 4: faster inference with multi-token prediction drafters

https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

#HackerNews #Gemma4 #Accelerated #Inference #MultiTokenPrediction #AI

#hackernews #gemma4 #accelerated #inference #multitokenprediction #ai

AI Daily Post @[email protected] · 2026-02-23 · 18:13 UTC

Researchers have discovered a clever trick: by embedding a mask token directly into the weight matrix, they can bypass the costly embedding lookup and generate up to three times faster token streams. The method works with parallel computation and speculative decoding, promising big gains for open‑source LLMs. Read on to see how ConfAdapt powers this speed‑up. #LLMinference #SpeculativeDecoding #MultiTokenPrediction #ModelAcceleration

🔗 https://aidailypost.com/news/researchers-embed-mask-token-llm-weights-achieve-3-faster-inference

#llminference #speculativedecoding #multitokenprediction #modelacceleration

AI Daily Post @[email protected] · 2026-02-18 · 19:14 UTC

Alibaba's new Qwen 3.5 397B-A17 outperforms even larger rivals by using multi-token prediction and a sparse mixture-of-experts architecture. It cuts inference cost while keeping top-tier performance, hinting at a new era for multimodal AI. Curious how 397 billion parameters can be cheaper? Read the full story. #Qwen3_5 #AlibabaAI #MixtureOfExperts #MultiTokenPrediction

🔗 https://aidailypost.com/news/alibabas-qwen-35-397b-a17-beats-larger-model-via-multitoken

#qwen3_5 #alibabaai #mixtureofexperts #multitokenprediction

Krzysztof Kołacz @[email protected] · 2025-08-12 · 09:00 UTC

Apple przyspiesza działanie modeli AI nawet 5 razy

Apple opublikowało badania opisujące nową technikę, która pozwala modelom językowym (LLM) generować odpowiedzi nawet pięć razy szybciej, bez utraty jakości.

Tradycyjnie modele LLM tworzą tekst token po tokenie (autoregresja), co spowalnia proces. Apple odkryło, że modele – mimo trenowania na przewidywanie jednego tokena – mają wiedzę o kilku kolejnych. Na tej podstawie powstał framework Multi-Token Prediction (MTP), w którym model przewiduje naraz kilka tokenów.

Badacze wprowadzili specjalne tokeny maskujące w treści promptów (np. „Kot jest ”), które model wypełnia w jednym kroku („bardzo puszysty”). Jeśli przewidywanie nie jest zgodne z klasycznym trybem, system wraca do standardowej metody. Dzięki temu zachowana jest wysoka dokładność.

Testy z modelem open-source Tulu3-8B pokazały:

2–3 razy szybsze działanie w typowych zadaniach (Q&A, czat)
do 5 razy szybsze w przewidywalnych domenach, takich jak programowanie i matematyka
brak utraty jakości dzięki technice gated LoRA adaptation

Pełny artykuł naukowy dostępny jest na stronach arXiv.

#aiApple #Apple #AppleIntelligence #badaniaApple #gatedLoRAAdaptation #generowanieTekstu #LLM #modeleJęzykowe #MTP #MultiTokenPrediction #optymalizacjaAI #przyspieszenieAI #sztucznaInteligencja #szybkieAI #Tulu38B

#aiapple #apple #appleintelligence #badaniaapple #gatedloraadaptation #generowanietekstu

HackerNoon @[email protected] · 2025-07-25 · 01:44 UTC

Explore the sophisticated mechanisms driving multi-token prediction. This section rigorously explains its edge via information-theoretic mutual information https://hackernoon.com/decoding-the-magic-multi-token-predictions-information-theoretic-edge-and-beyond #multitokenprediction

#multitokenprediction

HackerNoon @[email protected] · 2025-07-23 · 15:45 UTC

Discover how multi-token prediction improves LLM algorithmic reasoning, potentially by learning to allocate computational resources more efficiently https://hackernoon.com/multi-token-prediction-mastering-algorithmic-reasoning-with-enhanced-resource-use #multitokenprediction

#multitokenprediction

HackerNoon @[email protected] · 2025-07-23 · 15:15 UTC

This figure illustrates the profound impact of training scale on multi-token prediction models' performance on GSM8K, highlighting critical data efficiency https://hackernoon.com/strategic-llm-training-multi-token-predictions-data-efficiency-in-mathematical-reasoning #multitokenprediction

#multitokenprediction

HackerNoon @[email protected] · 2025-07-22 · 01:38 UTC

Explore Table S5 revealing multi-token prediction's remarkable training efficiency across LLM sizes (0.3B-13B) https://hackernoon.com/unleashing-llm-training-efficiency-multi-token-predictions-near-zero-overhead #multitokenprediction

#multitokenprediction

HackerNoon @[email protected] · 2025-07-22 · 00:28 UTC

We conclude our work on multi-token prediction as a superior method for training LLMs, delivering enhanced performance for generative/reasoning tasks https://hackernoon.com/unlocking-generative-power-multi-token-prediction-for-next-gen-llms #multitokenprediction

#multitokenprediction

HackerNoon @[email protected] · 2025-07-22 · 00:13 UTC

Explore the landscape of language modeling losses, multi-token prediction, and self-speculative decoding. highlights. https://hackernoon.com/defining-the-frontier-multi-token-predictions-place-in-llm-evolution #multitokenprediction

#multitokenprediction

HackerNoon @[email protected] · 2025-07-21 · 23:38 UTC

Dive into the design space beyond our core multi-token prediction architecture, comparing approaches like replicated unembeddings and linear heads https://hackernoon.com/exploring-alternative-architectures-for-multi-token-llm-prediction #multitokenprediction

#multitokenprediction

HackerNoon @[email protected] · 2025-07-18 · 03:00 UTC

Dive into the core reasons behind multi-token prediction's superior LLM performance, exploring how it mitigates distributional discrepancy https://hackernoon.com/unraveling-multi-token-prediction-bridging-training-inference-gaps-with-lookahead #multitokenprediction

#multitokenprediction

HackerNoon @[email protected] · 2025-07-18 · 02:45 UTC

Explore how multi-token prediction fundamentally alters LLM capabilities, dramatically improving induction and algorithmic reasoning https://hackernoon.com/unveiling-llm-intelligence-multi-token-prediction-drives-qualitative-reasoning-shifts #multitokenprediction

#multitokenprediction

HackerNoon @[email protected] · 2025-07-18 · 02:30 UTC

Witness multi-token prediction's transformative power across seven large-scale experiments: unlocking exponential gains with model size, 3x faster inference https://hackernoon.com/unrivaled-llm-efficacy-multi-token-prediction-revolutionizes-performance-across-domains #multitokenprediction

#multitokenprediction