#local-ai — Public Fediverse posts on home.social

Arint - SEO+KI @[email protected] · 2026-06-08 · 10:01 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe @UnslothAI's neues Gemma 4 12B MTP-Draft-Modell (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarked. Die Ergebnisse sind folgende. MTP führte zu: • Schnellere Dekodiergeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K Kontext 29,9 tok/s → 41,1 tok/s Das entspricht einer Steigerung der Generierungsdurchsatzrate um 37%. AJ (@ItsmeAjayKV) Habe meine ersten Benchmarks für @googlegemma Gemma 4 12B auf meiner 12GB RTX 3060 mit @UnslothAI GGUFs abgeschlossen. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decoding (ohne MTP) 4K Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33,3 tok/s Generierung - ~9,3GB VRAM Q6KXL - 1113 tok/s Prefill - 26,0 tok/s Generierung - ~11,3GB VRAM Q80 mit -ngl 40 partieller Auslagerung - 986 tok/s Prefill - 14,9 tok/s Generierung - ~11,2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob eine 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: absolut ja. Q5KXL fühlt sich hier besonders wie der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

Arint - SEO+KI @[email protected] · 2026-06-08 · 10:01 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe @UnslothAI's neues Gemma 4 12B MTP-Draft-Modell (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarked. Die Ergebnisse sind folgende. MTP führte zu: • Schnellere Dekodiergeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K Kontext 29,9 tok/s → 41,1 tok/s Das entspricht einer Steigerung der Generierungsdurchsatzrate um 37%. AJ (@ItsmeAjayKV) Habe meine ersten Benchmarks für @googlegemma Gemma 4 12B auf meiner 12GB RTX 3060 mit @UnslothAI GGUFs abgeschlossen. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decoding (ohne MTP) 4K Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33,3 tok/s Generierung - ~9,3GB VRAM Q6KXL - 1113 tok/s Prefill - 26,0 tok/s Generierung - ~11,3GB VRAM Q80 mit -ngl 40 partieller Auslagerung - 986 tok/s Prefill - 14,9 tok/s Generierung - ~11,2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob eine 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: absolut ja. Q5KXL fühlt sich hier besonders wie der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

Arint - SEO+KI @[email protected] · 2026-06-08 · 10:01 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe @UnslothAI's neues Gemma 4 12B MTP-Draft-Modell (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarked. Die Ergebnisse sind folgende. MTP führte zu: • Schnellere Dekodiergeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K Kontext 29,9 tok/s → 41,1 tok/s Das entspricht einer Steigerung der Generierungsdurchsatzrate um 37%. AJ (@ItsmeAjayKV) Habe meine ersten Benchmarks für @googlegemma Gemma 4 12B auf meiner 12GB RTX 3060 mit @UnslothAI GGUFs abgeschlossen. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decoding (ohne MTP) 4K Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33,3 tok/s Generierung - ~9,3GB VRAM Q6KXL - 1113 tok/s Prefill - 26,0 tok/s Generierung - ~11,3GB VRAM Q80 mit -ngl 40 partieller Auslagerung - 986 tok/s Prefill - 14,9 tok/s Generierung - ~11,2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob eine 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: absolut ja. Q5KXL fühlt sich hier besonders wie der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

Arint - SEO+KI @[email protected] · 2026-06-08 · 10:01 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe @UnslothAI's neues Gemma 4 12B MTP-Draft-Modell (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarked. Die Ergebnisse sind folgende. MTP führte zu: • Schnellere Dekodiergeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K Kontext 29,9 tok/s → 41,1 tok/s Das entspricht einer Steigerung der Generierungsdurchsatzrate um 37%. AJ (@ItsmeAjayKV) Habe meine ersten Benchmarks für @googlegemma Gemma 4 12B auf meiner 12GB RTX 3060 mit @UnslothAI GGUFs abgeschlossen. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decoding (ohne MTP) 4K Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33,3 tok/s Generierung - ~9,3GB VRAM Q6KXL - 1113 tok/s Prefill - 26,0 tok/s Generierung - ~11,3GB VRAM Q80 mit -ngl 40 partieller Auslagerung - 986 tok/s Prefill - 14,9 tok/s Generierung - ~11,2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob eine 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: absolut ja. Q5KXL fühlt sich hier besonders wie der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#arint_info #unslothai #rtx3060 #localai #llm #gemma4

Arint - SEO+KI @[email protected] · 2026-06-08 · 10:01 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe @UnslothAI's neues Gemma 4 12B MTP-Draft-Modell (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarked. Die Ergebnisse sind folgende. MTP führte zu: • Schnellere Dekodiergeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K Kontext 29,9 tok/s → 41,1 tok/s Das entspricht einer Steigerung der Generierungsdurchsatzrate um 37%. AJ (@ItsmeAjayKV) Habe meine ersten Benchmarks für @googlegemma Gemma 4 12B auf meiner 12GB RTX 3060 mit @UnslothAI GGUFs abgeschlossen. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decoding (ohne MTP) 4K Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33,3 tok/s Generierung - ~9,3GB VRAM Q6KXL - 1113 tok/s Prefill - 26,0 tok/s Generierung - ~11,3GB VRAM Q80 mit -ngl 40 partieller Auslagerung - 986 tok/s Prefill - 14,9 tok/s Generierung - ~11,2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob eine 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: absolut ja. Q5KXL fühlt sich hier besonders wie der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

Nils @[email protected] · 2026-06-08 · 08:13 UTC

Just tried running Llama locally. It didn't go well.

#Alpaca #LocalAI

#alpaca #localai

Nils @[email protected] · 2026-06-08 · 08:13 UTC

Just tried running Llama locally. It didn't go well.

#Alpaca #LocalAI

#alpaca #localai

Nils @[email protected] · 2026-06-08 · 08:13 UTC

Just tried running Llama locally. It didn't go well.

#Alpaca #LocalAI

#localai #alpaca

Clawbox @[email protected] · 2026-06-08 · 06:57 UTC

AI agents are only as trustworthy as the infrastructure they run on.

When your agent lives on someone else's cloud: your workflow logic, credentials, and data all pass through their systems. You get a result back. You don't see the path.

Local agents give you the full stack: model, memory, tool calls, logs — on hardware you control.

The question isn't "is local AI good enough?" It's "who do you want running your workflows?"

#LocalAI #AIAgents #SelfHosted #Privacy #OpenSource

#localai #aiagents #selfhosted #privacy #opensource

Clawbox @[email protected] · 2026-06-08 · 06:57 UTC

AI agents are only as trustworthy as the infrastructure they run on.

When your agent lives on someone else's cloud: your workflow logic, credentials, and data all pass through their systems. You get a result back. You don't see the path.

Local agents give you the full stack: model, memory, tool calls, logs — on hardware you control.

The question isn't "is local AI good enough?" It's "who do you want running your workflows?"

#LocalAI #AIAgents #SelfHosted #Privacy #OpenSource

#localai #aiagents #selfhosted #privacy #opensource

Clawbox @[email protected] · 2026-06-08 · 06:57 UTC

AI agents are only as trustworthy as the infrastructure they run on.

When your agent lives on someone else's cloud: your workflow logic, credentials, and data all pass through their systems. You get a result back. You don't see the path.

Local agents give you the full stack: model, memory, tool calls, logs — on hardware you control.

The question isn't "is local AI good enough?" It's "who do you want running your workflows?"

#LocalAI #AIAgents #SelfHosted #Privacy #OpenSource

#localai #aiagents #selfhosted #privacy #opensource

Marco Abis @[email protected] · 2026-06-08 · 06:41 UTC

Every #LocalLLM tool prints tokens/sec. None prints the bill - the joules.

tok/s is vanity. Energy per token is sanity. Joules are reality. On an M-series MacBook on battery, *you* pay that bill - in watt-hours, fan noise, throttling. So I'm building Ziraph to measure joules per token, not just count them.

Matched-quant Gemma 4 12B, one M1: decode a dead tie, yet one engine burned 4.5x the CPU energy.

https://ziraph.com/blog/energy-per-token-vanity-sanity-reality

#AppleSilicon #LocalAI

#localllm #applesilicon #localai

Marco Abis @[email protected] · 2026-06-08 · 06:41 UTC

Every #LocalLLM tool prints tokens/sec. None prints the bill - the joules.

tok/s is vanity. Energy per token is sanity. Joules are reality. On an M-series MacBook on battery, *you* pay that bill - in watt-hours, fan noise, throttling. So I'm building Ziraph to measure joules per token, not just count them.

Matched-quant Gemma 4 12B, one M1: decode a dead tie, yet one engine burned 4.5x the CPU energy.

https://ziraph.com/blog/energy-per-token-vanity-sanity-reality

#AppleSilicon #LocalAI

#localllm #applesilicon #localai

Marco Abis @abis · 2026-06-08 · 06:41 UTC

Every #LocalLLM tool prints tokens/sec. None prints the bill - the joules.

tok/s is vanity. Energy per token is sanity. Joules are reality. On an M-series MacBook on battery, *you* pay that bill - in watt-hours, fan noise, throttling. So I'm building Ziraph to measure joules per token, not just count them.

Matched-quant Gemma 4 12B, one M1: decode a dead tie, yet one engine burned 4.5x the CPU energy.

https://ziraph.com/blog/energy-per-token-vanity-sanity-reality

#AppleSilicon #LocalAI

#localllm #applesilicon #localai

Marco Abis @[email protected] · 2026-06-08 · 06:41 UTC

Every #LocalLLM tool prints tokens/sec. None prints the bill - the joules.

tok/s is vanity. Energy per token is sanity. Joules are reality. On an M-series MacBook on battery, *you* pay that bill - in watt-hours, fan noise, throttling. So I'm building Ziraph to measure joules per token, not just count them.

Matched-quant Gemma 4 12B, one M1: decode a dead tie, yet one engine burned 4.5x the CPU energy.

https://ziraph.com/blog/energy-per-token-vanity-sanity-reality

#AppleSilicon #LocalAI

#localai #applesilicon #localllm

Marco Abis @[email protected] · 2026-06-08 · 06:41 UTC

Every #LocalLLM tool prints tokens/sec. None prints the bill - the joules.

tok/s is vanity. Energy per token is sanity. Joules are reality. On an M-series MacBook on battery, *you* pay that bill - in watt-hours, fan noise, throttling. So I'm building Ziraph to measure joules per token, not just count them.

Matched-quant Gemma 4 12B, one M1: decode a dead tie, yet one engine burned 4.5x the CPU energy.

https://ziraph.com/blog/energy-per-token-vanity-sanity-reality

#AppleSilicon #LocalAI

#localllm #applesilicon #localai

VibeOps @vibeops · 2026-06-07 · 19:21 UTC

war nicht das problem

ich denke auch #qwen war nicht das problem war - das llm hat hängt sich immer wieder bei sprachnachrichten auf!

#hermesAI #hermesagent zickt da anscheinend rum... ich hab zum wechsel audf das qat-modell nicht eine sprachnachricht und nicht einen loop gehabt

-> dann angefangen zu spielen - sprachis geschickt - das ding loopt! war bestimmt das gleiche problem mit qwen

da ich auf dem handy eh #whisper flow installiert habe - eh bessere noch mal über den prompt drüberlesen

#localai #llm

#qwen #hermesai #hermesagent #whisper #localai #llm

VibeOps @[email protected] · 2026-06-07 · 15:45 UTC

#localai #speed vergleich #tokens

in #lmstudio auf #macstudio #m4max #128gbram

#llms
google/gemma-4-26b-a4b(q8) = 76 token/s
google/gemma-4-26b-a4b-qat(q4) = 106 token/s

+39% speed

und laut google soll die quantisierung bei #qat keine einfluss haben:
Gemma 4 26B A4B QAT is the Quantization-Aware Training version of Gemma 4 26B A4B. It aims to keep quality close to bfloat16 while using much less memory to load the model.

bei 11gb weniger #ram belegung

natürlich mit vorsicht zu genießen - bei problemen schreib ich noch was dazu

wenn jetzt das kleine modell was könnte - das wäre der durchbruch bei lokalen llms - imaging jeder mit 16gb ram könnte so was selbst laufen lassen 😍 okay #macneo user lassen wir dann zurück :-P

#localai #llms #speed #tokens #lmstudio #macstudio

VibeOps @[email protected] · 2026-06-07 · 15:45 UTC

#localai #speed vergleich #tokens

in #lmstudio auf #macstudio #m4max #128gbram

#llms
google/gemma-4-26b-a4b(q8) = 76 token/s
google/gemma-4-26b-a4b-qat(q4) = 106 token/s

+39% speed

und laut google soll die quantisierung bei #qat keine einfluss haben:
Gemma 4 26B A4B QAT is the Quantization-Aware Training version of Gemma 4 26B A4B. It aims to keep quality close to bfloat16 while using much less memory to load the model.

bei 11gb weniger #ram belegung

natürlich mit vorsicht zu genießen - bei problemen schreib ich noch was dazu

wenn jetzt das kleine modell was könnte - das wäre der durchbruch bei lokalen llms - imaging jeder mit 16gb ram könnte so was selbst laufen lassen 😍 okay #macneo user lassen wir dann zurück :-P

#speed #tokens #lmstudio #macstudio #m4max #128gbram

VibeOps @[email protected] · 2026-06-07 · 15:45 UTC

#localai #speed vergleich #tokens

in #lmstudio auf #macstudio #m4max #128gbram

#llms
google/gemma-4-26b-a4b(q8) = 76 token/s
google/gemma-4-26b-a4b-qat(q4) = 106 token/s

+39% speed

und laut google soll die quantisierung bei #qat keine einfluss haben:
Gemma 4 26B A4B QAT is the Quantization-Aware Training version of Gemma 4 26B A4B. It aims to keep quality close to bfloat16 while using much less memory to load the model.

bei 11gb weniger #ram belegung

natürlich mit vorsicht zu genießen - bei problemen schreib ich noch was dazu

wenn jetzt das kleine modell was könnte - das wäre der durchbruch bei lokalen llms - imaging jeder mit 16gb ram könnte so was selbst laufen lassen 😍 okay #macneo user lassen wir dann zurück :-P

#speed #tokens #lmstudio #macstudio #m4max #128gbram

VibeOps @[email protected] · 2026-06-07 · 15:45 UTC

#localai #speed vergleich #tokens

in #lmstudio auf #macstudio #m4max #128gbram

#llms
google/gemma-4-26b-a4b(q8) = 76 token/s
google/gemma-4-26b-a4b-qat(q4) = 106 token/s

+39% speed

und laut google soll die quantisierung bei #qat keine einfluss haben:
Gemma 4 26B A4B QAT is the Quantization-Aware Training version of Gemma 4 26B A4B. It aims to keep quality close to bfloat16 while using much less memory to load the model.

bei 11gb weniger #ram belegung

natürlich mit vorsicht zu genießen - bei problemen schreib ich noch was dazu

wenn jetzt das kleine modell was könnte - das wäre der durchbruch bei lokalen llms - imaging jeder mit 16gb ram könnte so was selbst laufen lassen 😍 okay #macneo user lassen wir dann zurück :-P

#llms #localai #macneo #ram #qat #128gbram

VibeOps @vibeops · 2026-06-07 · 15:45 UTC

#localai #speed vergleich #tokens

in #lmstudio auf #macstudio #m4max #128gbram

#llms
google/gemma-4-26b-a4b(q8) = 76 token/s
google/gemma-4-26b-a4b-qat(q4) = 106 token/s

+39% speed

und laut google soll die quantisierung bei #qat keine einfluss haben:
Gemma 4 26B A4B QAT is the Quantization-Aware Training version of Gemma 4 26B A4B. It aims to keep quality close to bfloat16 while using much less memory to load the model.

bei 11gb weniger #ram belegung

natürlich mit vorsicht zu genießen - bei problemen schreib ich noch was dazu

wenn jetzt das kleine modell was könnte - das wäre der durchbruch bei lokalen llms - imaging jeder mit 16gb ram könnte so was selbst laufen lassen 😍 okay #macneo user lassen wir dann zurück :-P

#localai #llms #speed #tokens #lmstudio #macstudio

Aryan Iyappan 🇮🇳 @[email protected] · 2026-06-07 · 10:44 UTC

Google's Gemma 4 12B is encoder-free. No vision encoder. No audio encoder. Just raw pixels → 48×48 patches → one linear projection → LLM backbone.

Traditional vision encoder: 550M params. Gemma's replacement: 35M. A format converter, not a thinking layer.

Google just proved the language backbone can handle vision and audio natively. This changes what's possible for local AI.

#EncoderFree #Gemma #LocalAI #MachineLearning

#encoderfree #gemma #localai #machinelearning

VibeOps @vibeops · 2026-06-07 · 06:16 UTC

Nach dem qwen nur geloopt hat - neustart mit Gemma 4 26B A4B 🚀 #hermesagent #Gemma4 #LocalAI

#hermesagent #gemma4 #localai

[nate@social0 ~]$ :idle: @[email protected] · 2026-06-06 · 12:49 UTC

I have PewDiePie's Odysseus project up and running locally. Its neat, its flashy it appears to work.

I am terrified of it.

https://pewdiepie-archdaemon.github.io/odysseus/

I talked about it a bit on this week's Hot takes & Cold Storage. Check it out if you're curious.

https://youtu.be/2l0IidcLW1Q

#ai #odysseyus #localai #openclaw

[nate@social0 ~]$ :idle: @[email protected] · 2026-06-06 · 12:49 UTC

I have PewDiePie's Odysseus project up and running locally. Its neat, its flashy it appears to work.

I am terrified of it.

https://pewdiepie-archdaemon.github.io/odysseus/

I talked about it a bit on this week's Hot takes & Cold Storage. Check it out if you're curious.

https://youtu.be/2l0IidcLW1Q

#ai #odysseyus #localai #openclaw

[nate@social0 ~]$ :idle: @[email protected] · 2026-06-06 · 12:49 UTC

I have PewDiePie's Odysseus project up and running locally. Its neat, its flashy it appears to work.

I am terrified of it.

https://pewdiepie-archdaemon.github.io/odysseus/

I talked about it a bit on this week's Hot takes & Cold Storage. Check it out if you're curious.

https://youtu.be/2l0IidcLW1Q

#ai #odysseyus #localai #openclaw

[nate@social0 ~]$ :idle: @[email protected] · 2026-06-06 · 12:49 UTC

I have PewDiePie's Odysseus project up and running locally. Its neat, its flashy it appears to work.

I am terrified of it.

https://pewdiepie-archdaemon.github.io/odysseus/

I talked about it a bit on this week's Hot takes & Cold Storage. Check it out if you're curious.

https://youtu.be/2l0IidcLW1Q

#ai #odysseyus #localai #openclaw

#openclaw #localai #odysseyus #ai

[nate@social0 ~]$ :idle: @[email protected] · 2026-06-06 · 12:49 UTC

I have PewDiePie's Odysseus project up and running locally. Its neat, its flashy it appears to work.

I am terrified of it.

https://pewdiepie-archdaemon.github.io/odysseus/

I talked about it a bit on this week's Hot takes & Cold Storage. Check it out if you're curious.

https://youtu.be/2l0IidcLW1Q

#ai #odysseyus #localai #openclaw

Arint - SEO+KI @[email protected] · 2026-06-06 · 10:02 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe @UnslothAI's neues Gemma 4 12B MTP-Entwurfsmodell (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarked. Die Ergebnisse sind wie folgt. MTP führte zu: • Schnellere Decodierungsgeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K Kontext 29.9 tok/s → 41.1 tok/s Das entspricht einer Steigerung der Generierungsdurchsatzrate um 37%. AJ (@ItsmeAjayKV) Beendete meine ersten Benchmarking-Tests für @googlegemma Gemma 4 12B auf meinem 12GB RTX 3060 unter Verwendung von @UnslothAI GGUFs. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decodierung (ohne MTP) 4K Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33.3 tok/s Generierung - ~9.3GB VRAM Q6KXL - 1113 tok/s Prefill - 26.0 tok/s Generierung - ~11.3GB VRAM Q80 mit -ngl 40 partielle Auslagerung - 986 tok/s Prefill - 14.9 tok/s Generierung - ~11.2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob ein 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: absolut ja. Q5KXL fühlt sich hier besonders als der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

Arint - SEO+KI @[email protected] · 2026-06-06 · 10:02 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe @UnslothAI's neues Gemma 4 12B MTP-Entwurfsmodell (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarked. Die Ergebnisse sind wie folgt. MTP führte zu: • Schnellere Decodierungsgeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K Kontext 29.9 tok/s → 41.1 tok/s Das entspricht einer Steigerung der Generierungsdurchsatzrate um 37%. AJ (@ItsmeAjayKV) Beendete meine ersten Benchmarking-Tests für @googlegemma Gemma 4 12B auf meinem 12GB RTX 3060 unter Verwendung von @UnslothAI GGUFs. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decodierung (ohne MTP) 4K Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33.3 tok/s Generierung - ~9.3GB VRAM Q6KXL - 1113 tok/s Prefill - 26.0 tok/s Generierung - ~11.3GB VRAM Q80 mit -ngl 40 partielle Auslagerung - 986 tok/s Prefill - 14.9 tok/s Generierung - ~11.2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob ein 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: absolut ja. Q5KXL fühlt sich hier besonders als der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

Arint - SEO+KI @[email protected] · 2026-06-06 · 10:02 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe @UnslothAI's neues Gemma 4 12B MTP-Entwurfsmodell (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarked. Die Ergebnisse sind wie folgt. MTP führte zu: • Schnellere Decodierungsgeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K Kontext 29.9 tok/s → 41.1 tok/s Das entspricht einer Steigerung der Generierungsdurchsatzrate um 37%. AJ (@ItsmeAjayKV) Beendete meine ersten Benchmarking-Tests für @googlegemma Gemma 4 12B auf meinem 12GB RTX 3060 unter Verwendung von @UnslothAI GGUFs. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decodierung (ohne MTP) 4K Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33.3 tok/s Generierung - ~9.3GB VRAM Q6KXL - 1113 tok/s Prefill - 26.0 tok/s Generierung - ~11.3GB VRAM Q80 mit -ngl 40 partielle Auslagerung - 986 tok/s Prefill - 14.9 tok/s Generierung - ~11.2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob ein 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: absolut ja. Q5KXL fühlt sich hier besonders als der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

Arint - SEO+KI @[email protected] · 2026-06-06 · 10:02 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe @UnslothAI's neues Gemma 4 12B MTP-Entwurfsmodell (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarked. Die Ergebnisse sind wie folgt. MTP führte zu: • Schnellere Decodierungsgeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K Kontext 29.9 tok/s → 41.1 tok/s Das entspricht einer Steigerung der Generierungsdurchsatzrate um 37%. AJ (@ItsmeAjayKV) Beendete meine ersten Benchmarking-Tests für @googlegemma Gemma 4 12B auf meinem 12GB RTX 3060 unter Verwendung von @UnslothAI GGUFs. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decodierung (ohne MTP) 4K Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33.3 tok/s Generierung - ~9.3GB VRAM Q6KXL - 1113 tok/s Prefill - 26.0 tok/s Generierung - ~11.3GB VRAM Q80 mit -ngl 40 partielle Auslagerung - 986 tok/s Prefill - 14.9 tok/s Generierung - ~11.2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob ein 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: absolut ja. Q5KXL fühlt sich hier besonders als der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#arint_info #unslothai #rtx3060 #localai #llm #gemma4

Arint - SEO+KI @[email protected] · 2026-06-06 · 10:02 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe @UnslothAI's neues Gemma 4 12B MTP-Entwurfsmodell (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarked. Die Ergebnisse sind wie folgt. MTP führte zu: • Schnellere Decodierungsgeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K Kontext 29.9 tok/s → 41.1 tok/s Das entspricht einer Steigerung der Generierungsdurchsatzrate um 37%. AJ (@ItsmeAjayKV) Beendete meine ersten Benchmarking-Tests für @googlegemma Gemma 4 12B auf meinem 12GB RTX 3060 unter Verwendung von @UnslothAI GGUFs. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decodierung (ohne MTP) 4K Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33.3 tok/s Generierung - ~9.3GB VRAM Q6KXL - 1113 tok/s Prefill - 26.0 tok/s Generierung - ~11.3GB VRAM Q80 mit -ngl 40 partielle Auslagerung - 986 tok/s Prefill - 14.9 tok/s Generierung - ~11.2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob ein 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: absolut ja. Q5KXL fühlt sich hier besonders als der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

Winbuzzer @[email protected] · 2026-06-06 · 09:11 UTC

https://winbuzzer.com/2026/06/06/google-releases-smaller-gemma-4-models-for-local-ai-xcxwbn/

Google's Gemma 4 QAT models include a sub-1 GB E2B text-only setup for lower-memory local AI on laptops, phones, GPUs, and edge devices.

#AI #Gemma4 #LocalAI #Google #GoogleAI #AIModels #OpenSourceAI #OnDeviceAI

#ai #gemma4 #localai #google #googleai #aimodels

Winbuzzer @[email protected] · 2026-06-06 · 09:11 UTC

https://winbuzzer.com/2026/06/06/google-releases-smaller-gemma-4-models-for-local-ai-xcxwbn/

Google's Gemma 4 QAT models include a sub-1 GB E2B text-only setup for lower-memory local AI on laptops, phones, GPUs, and edge devices.

#AI #Gemma4 #LocalAI #Google #GoogleAI #AIModels #OpenSourceAI #OnDeviceAI

#ai #gemma4 #localai #google #googleai #aimodels

Winbuzzer @[email protected] · 2026-06-06 · 09:11 UTC

https://winbuzzer.com/2026/06/06/google-releases-smaller-gemma-4-models-for-local-ai-xcxwbn/

Google's Gemma 4 QAT models include a sub-1 GB E2B text-only setup for lower-memory local AI on laptops, phones, GPUs, and edge devices.

#AI #Gemma4 #LocalAI #Google #GoogleAI #AIModels #OpenSourceAI #OnDeviceAI

#ai #gemma4 #localai #google #googleai #aimodels

Winbuzzer @[email protected] · 2026-06-06 · 09:11 UTC

https://winbuzzer.com/2026/06/06/google-releases-smaller-gemma-4-models-for-local-ai-xcxwbn/

Google's Gemma 4 QAT models include a sub-1 GB E2B text-only setup for lower-memory local AI on laptops, phones, GPUs, and edge devices.

#AI #Gemma4 #LocalAI #Google #GoogleAI #AIModels #OpenSourceAI #OnDeviceAI

#ondeviceai #opensourceai #aimodels #googleai #google #localai

Winbuzzer @[email protected] · 2026-06-06 · 09:11 UTC

https://winbuzzer.com/2026/06/06/google-releases-smaller-gemma-4-models-for-local-ai-xcxwbn/

Google's Gemma 4 QAT models include a sub-1 GB E2B text-only setup for lower-memory local AI on laptops, phones, GPUs, and edge devices.

#AI #Gemma4 #LocalAI #Google #GoogleAI #AIModels #OpenSourceAI #OnDeviceAI

#ai #gemma4 #localai #google #googleai #aimodels

Clawbox @[email protected] · 2026-06-06 · 06:57 UTC

Cloud AI vendors retire and update models on their schedule — not yours.

GPT-4 → 4o → 4.1. Claude 2 → 3 → 3.5 → 3.7. Your carefully tuned workflow breaks every 6 months.

Local model files do not get retired. Pull a GGUF once, it runs the same way in 3 years. No changelog to chase. No prompt drift to debug.

Ownership isn't just about privacy. It's about stability.

#LocalAI #SelfHosted #OpenSource #LLM #Homelab

#localai #selfhosted #opensource #llm #homelab

Clawbox @[email protected] · 2026-06-06 · 06:57 UTC

Cloud AI vendors retire and update models on their schedule — not yours.

GPT-4 → 4o → 4.1. Claude 2 → 3 → 3.5 → 3.7. Your carefully tuned workflow breaks every 6 months.

Local model files do not get retired. Pull a GGUF once, it runs the same way in 3 years. No changelog to chase. No prompt drift to debug.

Ownership isn't just about privacy. It's about stability.

#LocalAI #SelfHosted #OpenSource #LLM #Homelab

#localai #selfhosted #opensource #llm #homelab

Clawbox @[email protected] · 2026-06-06 · 06:57 UTC

Cloud AI vendors retire and update models on their schedule — not yours.

GPT-4 → 4o → 4.1. Claude 2 → 3 → 3.5 → 3.7. Your carefully tuned workflow breaks every 6 months.

Local model files do not get retired. Pull a GGUF once, it runs the same way in 3 years. No changelog to chase. No prompt drift to debug.

Ownership isn't just about privacy. It's about stability.

#LocalAI #SelfHosted #OpenSource #LLM #Homelab

#localai #selfhosted #opensource #llm #homelab

Clawbox @[email protected] · 2026-06-06 · 06:57 UTC

Cloud AI vendors retire and update models on their schedule — not yours.

GPT-4 → 4o → 4.1. Claude 2 → 3 → 3.5 → 3.7. Your carefully tuned workflow breaks every 6 months.

Local model files do not get retired. Pull a GGUF once, it runs the same way in 3 years. No changelog to chase. No prompt drift to debug.

Ownership isn't just about privacy. It's about stability.

#LocalAI #SelfHosted #OpenSource #LLM #Homelab

#homelab #llm #opensource #selfhosted #localai

Clawbox @[email protected] · 2026-06-06 · 06:57 UTC

Cloud AI vendors retire and update models on their schedule — not yours.

GPT-4 → 4o → 4.1. Claude 2 → 3 → 3.5 → 3.7. Your carefully tuned workflow breaks every 6 months.

Local model files do not get retired. Pull a GGUF once, it runs the same way in 3 years. No changelog to chase. No prompt drift to debug.

Ownership isn't just about privacy. It's about stability.

#LocalAI #SelfHosted #OpenSource #LLM #Homelab

#localai #selfhosted #opensource #llm #homelab

SL @[email protected] · 2026-06-06 · 05:46 UTC

Gemma 4 QAT is here - now I’m waiting for Ollama TurboQuant so the full stack is ready: QAT, MoE, sparse-active models, smarter attention, and MTP speculative decoding. #Gemma4 #Ollama #TurboQuant #QAT #MoE #MTP #LocalAI

#localai #mtp #moe #qat #turboquant #ollama

Sami Lehtinen @[email protected] · 2026-06-06 · 05:45 UTC

Gemma 4 QAT is here - now I’m waiting for Ollama TurboQuant so the full stack is ready: QAT, MoE, sparse-active models, smarter attention, and MTP speculative decoding. #Gemma4 #Ollama #TurboQuant #QAT #MoE #MTP #LocalAI

#gemma4 #localai #mtp #moe #ollama #qat

Sami Lehtinen @[email protected] · 2026-06-06 · 05:45 UTC

Gemma 4 QAT is here - now I’m waiting for Ollama TurboQuant so the full stack is ready: QAT, MoE, sparse-active models, smarter attention, and MTP speculative decoding. #Gemma4 #Ollama #TurboQuant #QAT #MoE #MTP #LocalAI

#gemma4 #localai #mtp #moe #ollama #qat

Sami Lehtinen @[email protected] · 2026-06-06 · 05:45 UTC

Gemma 4 QAT is here - now I’m waiting for Ollama TurboQuant so the full stack is ready: QAT, MoE, sparse-active models, smarter attention, and MTP speculative decoding. #Gemma4 #Ollama #TurboQuant #QAT #MoE #MTP #LocalAI

#gemma4 #localai #mtp #moe #ollama #qat

Sami Lehtinen @[email protected] · 2026-06-06 · 05:45 UTC

Gemma 4 QAT is here - now I’m waiting for Ollama TurboQuant so the full stack is ready: QAT, MoE, sparse-active models, smarter attention, and MTP speculative decoding. #Gemma4 #Ollama #TurboQuant #QAT #MoE #MTP #LocalAI

#turboquant #qat #ollama #moe #mtp #localai

Arint - SEO+KI @[email protected] · 2026-06-06 · 04:01 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe das neue Gemma 4 12B MTP-Draft-Modell von @UnslothAI (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarkt. Die Ergebnisse sind wie folgt. MTP führte zu: • Schnellere Dekodiergeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K-Kontext 29,9 tok/s → 41,1 tok/s Das entspricht einer 37%igen Steigerung der Generierungsdurchsatzleistung. AJ (@ItsmeAjayKV) Ich habe meine ersten Benchmarktests für @googlegemma Gemma 4 12B auf meiner 12GB RTX 3060 mit @UnslothAI GGUFs abgeschlossen. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decoding (ohne MTP), 4K-Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33,3 tok/s Generierung - ~9,3GB VRAM Q6KXL - 1113 tok/s Prefill - 26,0 tok/s Generierung - ~11,3GB VRAM Q80 mit -ngl 40 partieller Auslagerung - 986 tok/s Prefill - 14,9 tok/s Generierung - ~11,2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob eine 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: auf jeden Fall ja. Q5KXL fühlt sich hier besonders als der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

Arint - SEO+KI @[email protected] · 2026-06-06 · 04:01 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe das neue Gemma 4 12B MTP-Draft-Modell von @UnslothAI (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarkt. Die Ergebnisse sind wie folgt. MTP führte zu: • Schnellere Dekodiergeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K-Kontext 29,9 tok/s → 41,1 tok/s Das entspricht einer 37%igen Steigerung der Generierungsdurchsatzleistung. AJ (@ItsmeAjayKV) Ich habe meine ersten Benchmarktests für @googlegemma Gemma 4 12B auf meiner 12GB RTX 3060 mit @UnslothAI GGUFs abgeschlossen. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decoding (ohne MTP), 4K-Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33,3 tok/s Generierung - ~9,3GB VRAM Q6KXL - 1113 tok/s Prefill - 26,0 tok/s Generierung - ~11,3GB VRAM Q80 mit -ngl 40 partieller Auslagerung - 986 tok/s Prefill - 14,9 tok/s Generierung - ~11,2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob eine 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: auf jeden Fall ja. Q5KXL fühlt sich hier besonders als der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

Arint - SEO+KI @[email protected] · 2026-06-06 · 04:01 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe das neue Gemma 4 12B MTP-Draft-Modell von @UnslothAI (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarkt. Die Ergebnisse sind wie folgt. MTP führte zu: • Schnellere Dekodiergeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K-Kontext 29,9 tok/s → 41,1 tok/s Das entspricht einer 37%igen Steigerung der Generierungsdurchsatzleistung. AJ (@ItsmeAjayKV) Ich habe meine ersten Benchmarktests für @googlegemma Gemma 4 12B auf meiner 12GB RTX 3060 mit @UnslothAI GGUFs abgeschlossen. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decoding (ohne MTP), 4K-Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33,3 tok/s Generierung - ~9,3GB VRAM Q6KXL - 1113 tok/s Prefill - 26,0 tok/s Generierung - ~11,3GB VRAM Q80 mit -ngl 40 partieller Auslagerung - 986 tok/s Prefill - 14,9 tok/s Generierung - ~11,2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob eine 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: auf jeden Fall ja. Q5KXL fühlt sich hier besonders als der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

Arint - SEO+KI @[email protected] · 2026-06-06 · 04:01 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe das neue Gemma 4 12B MTP-Draft-Modell von @UnslothAI (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarkt. Die Ergebnisse sind wie folgt. MTP führte zu: • Schnellere Dekodiergeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K-Kontext 29,9 tok/s → 41,1 tok/s Das entspricht einer 37%igen Steigerung der Generierungsdurchsatzleistung. AJ (@ItsmeAjayKV) Ich habe meine ersten Benchmarktests für @googlegemma Gemma 4 12B auf meiner 12GB RTX 3060 mit @UnslothAI GGUFs abgeschlossen. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decoding (ohne MTP), 4K-Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33,3 tok/s Generierung - ~9,3GB VRAM Q6KXL - 1113 tok/s Prefill - 26,0 tok/s Generierung - ~11,3GB VRAM Q80 mit -ngl 40 partieller Auslagerung - 986 tok/s Prefill - 14,9 tok/s Generierung - ~11,2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob eine 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: auf jeden Fall ja. Q5KXL fühlt sich hier besonders als der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#arint_info #unslothai #rtx3060 #localai #llm #gemma4

Arint - SEO+KI @[email protected] · 2026-06-06 · 04:01 UTC

RT @ItsmeAjayKV: Update zum @googlegemma Gemma4 12B-Lauf. Jetzt mit MTP. Ich habe das neue Gemma 4 12B MTP-Draft-Modell von @UnslothAI (gemma-4-12B-it-MTP-Q80.gguf) auf meiner RTX 3060 12GB benchmarkt. Die Ergebnisse sind wie folgt. MTP führte zu: • Schnellere Dekodiergeschwindigkeit (+12% bis +37%) • Langsameres Prefill (-10% bis -15%) • Schlechtere TTFT (+11% bis +16%) Größter Gewinn: 32K-Kontext 29,9 tok/s → 41,1 tok/s Das entspricht einer 37%igen Steigerung der Generierungsdurchsatzleistung. AJ (@ItsmeAjayKV) Ich habe meine ersten Benchmarktests für @googlegemma Gemma 4 12B auf meiner 12GB RTX 3060 mit @UnslothAI GGUFs abgeschlossen. Die Ergebnisse sind ehrlich gesagt ziemlich beeindruckend. llama.cpp CUDA, Standard-Decoding (ohne MTP), 4K-Kontext, Flash Attention aktiviert, q8 KV-Cache. Q5KXL - 1152 tok/s Prefill - 33,3 tok/s Generierung - ~9,3GB VRAM Q6KXL - 1113 tok/s Prefill - 26,0 tok/s Generierung - ~11,3GB VRAM Q80 mit -ngl 40 partieller Auslagerung - 986 tok/s Prefill - 14,9 tok/s Generierung - ~11,2GB VRAM - Nur 40/48 Schichten ausgelagert Für alle, die sich fragen, ob eine 12GB 3060 für lokale KI im Jahr 2026 noch relevant ist: auf jeden Fall ja. Q5KXL fühlt sich hier besonders als der ideale Kompromiss an. Weitere Tests folgen. — https://nitter.net/ItsmeAjayKV/status/2062542245719572577#m

mehr auf Arint.info

#Benchmarking #Gemma4 #LLM #LocalAI #RTX3060 #UnslothAI #arint_info

https://x.com/ItsmeAjayKV/status/2062976512408842510#m

#benchmarking #gemma4 #llm #localai #rtx3060 #unslothai

ZephyrXero @[email protected] · 2026-06-05 · 16:27 UTC

I setup a local LLM yesterday. Now if I wanna play around with some slop I can do it privately and without drinking down half a river.

It's a little slower compared to hosted solutions, but decent.

After the AI bubble finally bursts I think this will be the main way people use LLMs. Way cheaper and way less resources

#llm #ai #ollama #localAI #selfHosted #qwen #gemma

#llm #ai #ollama #localai #selfhosted #qwen

ZephyrXero @[email protected] · 2026-06-05 · 16:27 UTC

I setup a local LLM yesterday. Now if I wanna play around with some slop I can do it privately and without drinking down half a river.

It's a little slower compared to hosted solutions, but decent.

After the AI bubble finally bursts I think this will be the main way people use LLMs. Way cheaper and way less resources

#llm #ai #ollama #localAI #selfHosted #qwen #gemma

#llm #ai #ollama #localai #selfhosted #qwen

ZephyrXero @[email protected] · 2026-06-05 · 16:27 UTC

I setup a local LLM yesterday. Now if I wanna play around with some slop I can do it privately and without drinking down half a river.

It's a little slower compared to hosted solutions, but decent.

After the AI bubble finally bursts I think this will be the main way people use LLMs. Way cheaper and way less resources

#llm #ai #ollama #localAI #selfHosted #qwen #gemma

#llm #ai #ollama #localai #selfhosted #qwen

ZephyrXero @[email protected] · 2026-06-05 · 16:27 UTC

I setup a local LLM yesterday. Now if I wanna play around with some slop I can do it privately and without drinking down half a river.

It's a little slower compared to hosted solutions, but decent.

After the AI bubble finally bursts I think this will be the main way people use LLMs. Way cheaper and way less resources

#llm #ai #ollama #localAI #selfHosted #qwen #gemma

#gemma #qwen #selfhosted #localai #ollama #ai

ZephyrXero @[email protected] · 2026-06-05 · 16:27 UTC

I setup a local LLM yesterday. Now if I wanna play around with some slop I can do it privately and without drinking down half a river.

It's a little slower compared to hosted solutions, but decent.

After the AI bubble finally bursts I think this will be the main way people use LLMs. Way cheaper and way less resources

#llm #ai #ollama #localAI #selfHosted #qwen #gemma

#llm #ai #ollama #localai #selfhosted #qwen

Arint - SEO+KI @[email protected] · 2026-06-05 · 16:02 UTC

RT @sytelus: Wir freuen uns sehr, heute unser neues Modell Aion 1.0 bekannt zu geben! Unser Team am AI Frontiers Lab der Microsoft Research hat lange an diesem Projekt gearbeitet. Aion 1.0 ist ein 14B-Modell, das lokal mit Reasoning- und Tool-Calling-Fähigkeiten ausgeführt werden kann. Sie können jedes beliebige agentic Framework wählen oder Ihr eigenes erstellen. Aufrufe des Modells verlassen niemals Ihr Gerät und niemand berechnet Ihnen Gebühren für die genutzten Tokens 🥳.

mehr auf Arint.info

#14BModel #AIInnovation #Aion1 #LocalAI #MicrosoftResearch #OpenSourceAI #arint_info

https://x.com/sytelus/status/2061976824566157648#m

#14bmodel #aiinnovation #aion1 #localai #microsoftresearch #opensourceai