#llmbenchmark — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #llmbenchmark, aggregated by home.social.
-
Antigravity 2.0 Tops the OpenSCAD Architectural 3D LLM Benchmark
https://modelrift.com/blog/openscad-llm-benchmark/
#HackerNews #Antigravity #2.0 #OpenSCAD #3DPrinting #LLMBenchmark #Architecture
-
Antigravity 2.0 Tops the OpenSCAD Architectural 3D LLM Benchmark
https://modelrift.com/blog/openscad-llm-benchmark/
#HackerNews #Antigravity #2.0 #OpenSCAD #3DPrinting #LLMBenchmark #Architecture
-
Antigravity 2.0 Tops the OpenSCAD Architectural 3D LLM Benchmark
https://modelrift.com/blog/openscad-llm-benchmark/
#HackerNews #Antigravity #2.0 #OpenSCAD #3DPrinting #LLMBenchmark #Architecture
-
Antigravity 2.0 Tops the OpenSCAD Architectural 3D LLM Benchmark
https://modelrift.com/blog/openscad-llm-benchmark/
#HackerNews #Antigravity #2.0 #OpenSCAD #3DPrinting #LLMBenchmark #Architecture
-
Antigravity 2.0 Tops the OpenSCAD Architectural 3D LLM Benchmark
https://modelrift.com/blog/openscad-llm-benchmark/
#HackerNews #Antigravity #2.0 #OpenSCAD #3DPrinting #LLMBenchmark #Architecture
-
That's what I call a «meaningful» LLM benchmark. 😉
(... or how to debunk the German meaning of «Intelligenz».)
https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
-
That's what I call a «meaningful» LLM benchmark. 😉
(... or how to debunk the German meaning of «Intelligenz».)
https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
-
That's what I call a «meaningful» LLM benchmark. 😉
(... or how to debunk the German meaning of «Intelligenz».)
https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
-
That's what I call a «meaningful» LLM benchmark. 😉
(... or how to debunk the German meaning of «Intelligenz».)
https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
-
That's what I call a «meaningful» LLM benchmark. 😉
(... or how to debunk the German meaning of «Intelligenz».)
https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
-
Google’s new Gemini 3.1 Pro claims to double its reasoning scores on the latest benchmark, pushing LLM capabilities further. Curious how this stacks up against other open‑source models? Dive into the details and see what the numbers reveal. #GoogleGemini #Gemini3_1 #LLMbenchmark #GenerativeAI
🔗 https://aidailypost.com/news/google-gemini-31-pro-doubles-reasoning-performance-benchmark
-
Google’s new Gemini 3.1 Pro claims to double its reasoning scores on the latest benchmark, pushing LLM capabilities further. Curious how this stacks up against other open‑source models? Dive into the details and see what the numbers reveal. #GoogleGemini #Gemini3_1 #LLMbenchmark #GenerativeAI
🔗 https://aidailypost.com/news/google-gemini-31-pro-doubles-reasoning-performance-benchmark
-
Google’s new Gemini 3.1 Pro claims to double its reasoning scores on the latest benchmark, pushing LLM capabilities further. Curious how this stacks up against other open‑source models? Dive into the details and see what the numbers reveal. #GoogleGemini #Gemini3_1 #LLMbenchmark #GenerativeAI
🔗 https://aidailypost.com/news/google-gemini-31-pro-doubles-reasoning-performance-benchmark
-
Mô hình mở vs kín: Khoảng cách giữa điểm số và hiệu năng thực tế 🤖 #AI #MôHìnhLLM #DeepSeek #Grok #Claude
Mở: Xếp hạng cao trên benchmark SWE nhưng dễ sai lệnh, cần giám sát kỹ.
Kín (Claude 4.5 haiku): Tự lập, xử lý tài liệu dài & thực hiện nhiệm vụ phức tạp trơn tru.
Câu hỏi: Ai cũng gặp vấn đề tương tự hay chỉ mình mình?#MôHìnhKín #HiệuNăngThựcTế #AIResearch #OpenSource #LLMBenchmark
https://www.reddit.com/r/LocalLLaMA/comments/1qrl0j9/open_models_vs_closed_models_discrepancy_in/
-
Bàn về hiệu năng hệ thống AI workstation kép RTX PRO 6000 với 1.15TB RAM: So sánh xử lý GPU-only (INT4) vs CPU+GPU (fp8) trên mô hình MiniMax-M2.1. Kết quả: GPU-only nhanh hơn 2–4x ở prefill nhưng chỉ xử lý tối đa ~3 request đồng thời do giới hạn KV-cache..fp8 tuy chậm hơn nhưng mở rộng tốt hơn cho 10+ người dùng, đặc biệt với context dài. Queue time là điểm nghẽn quan trọng. Phù hợp cho agent coding nội bộ. #AIWorkstation #LLMBenchmark #MultiUserAI #GPUvsCPU #LocalLLM #HPC #MachineLearning #Tín
-
New benchmark shows top LLMs struggle in real mental health care
https://swordhealth.com/newsroom/sword-introduces-mindeval
#HackerNews #LLMbenchmark #MentalHealth #AIinHealthcare #MentalHealthTech #HealthcareInnovation
-
New benchmark shows top LLMs struggle in real mental health care
https://swordhealth.com/newsroom/sword-introduces-mindeval
#HackerNews #LLMbenchmark #MentalHealth #AIinHealthcare #MentalHealthTech #HealthcareInnovation
-
New benchmark shows top LLMs struggle in real mental health care
https://swordhealth.com/newsroom/sword-introduces-mindeval
#HackerNews #LLMbenchmark #MentalHealth #AIinHealthcare #MentalHealthTech #HealthcareInnovation
-
New benchmark shows top LLMs struggle in real mental health care
https://swordhealth.com/newsroom/sword-introduces-mindeval
#HackerNews #LLMbenchmark #MentalHealth #AIinHealthcare #MentalHealthTech #HealthcareInnovation
-
New benchmark shows top LLMs struggle in real mental health care
https://swordhealth.com/newsroom/sword-introduces-mindeval
#HackerNews #LLMbenchmark #MentalHealth #AIinHealthcare #MentalHealthTech #HealthcareInnovation
-
The proof that #benchmarks on #LLM models are utterly useless.
Maybe it's time to focus on real-world performance and practical applications instead of chasing numbers?
#llm #ai #aibenchmarks #llmbenchmark #machinelearning #artificialintelligence #openai #gpt5 #chatgpt
-
The proof that #benchmarks on #LLM models are utterly useless.
Maybe it's time to focus on real-world performance and practical applications instead of chasing numbers?
#llm #ai #aibenchmarks #llmbenchmark #machinelearning #artificialintelligence #openai #gpt5 #chatgpt
-
The proof that #benchmarks on #LLM models are utterly useless.
Maybe it's time to focus on real-world performance and practical applications instead of chasing numbers?
#llm #ai #aibenchmarks #llmbenchmark #machinelearning #artificialintelligence #openai #gpt5 #chatgpt
-
The proof that #benchmarks on #LLM models are utterly useless.
Maybe it's time to focus on real-world performance and practical applications instead of chasing numbers?
#llm #ai #aibenchmarks #llmbenchmark #machinelearning #artificialintelligence #openai #gpt5 #chatgpt
-
The proof that #benchmarks on #LLM models are utterly useless.
Maybe it's time to focus on real-world performance and practical applications instead of chasing numbers?
#llm #ai #aibenchmarks #llmbenchmark #machinelearning #artificialintelligence #openai #gpt5 #chatgpt
-
🚀 Featured in L'Usine Digitale!
Our independent multilingual LLM benchmark Phare was highlighted in an article detailing some key insights from our research.
🔎 Key finding: LLMs perpetuate biases in their own content while recognizing those same biases when asked directly.
Thanks to L'Usine Digitale and Célia Séramour for this coverage.
Read here: https://gisk.ar/4lCHoUB -
🚀 Featured in L'Usine Digitale!
Our independent multilingual LLM benchmark Phare was highlighted in an article detailing some key insights from our research.
🔎 Key finding: LLMs perpetuate biases in their own content while recognizing those same biases when asked directly.
Thanks to L'Usine Digitale and Célia Séramour for this coverage.
Read here: https://gisk.ar/4lCHoUB -
🚀 Featured in L'Usine Digitale!
Our independent multilingual LLM benchmark Phare was highlighted in an article detailing some key insights from our research.
🔎 Key finding: LLMs perpetuate biases in their own content while recognizing those same biases when asked directly.
Thanks to L'Usine Digitale and Célia Séramour for this coverage.
Read here: https://gisk.ar/4lCHoUB -
🚀 Featured in L'Usine Digitale!
Our independent multilingual LLM benchmark Phare was highlighted in an article detailing some key insights from our research.
🔎 Key finding: LLMs perpetuate biases in their own content while recognizing those same biases when asked directly.
Thanks to L'Usine Digitale and Célia Séramour for this coverage.
Read here: https://gisk.ar/4lCHoUB -
🚀 Featured in L'Usine Digitale!
Our independent multilingual LLM benchmark Phare was highlighted in an article detailing some key insights from our research.
🔎 Key finding: LLMs perpetuate biases in their own content while recognizing those same biases when asked directly.
Thanks to L'Usine Digitale and Célia Séramour for this coverage.
Read here: https://gisk.ar/4lCHoUB -
🚀 Claude 4 didn’t just assist—it outperformed.
In a 7-hour live dev session, it refactored legacy Java with zero hallucinations, full memory, and enterprise-grade precision.🔍 We compared Claude 4 vs ChatGPT across 5 key metrics — and the results will surprise you.
📖 Read the full breakdown:
👉 https://medium.com/@rogt.x1997/claude-4-vs-chatgpt-the-5-metrics-that-prove-its-not-just-an-assistant-48f82384c69f📌 #Claude4 #LLMbenchmark #AIengineering #Anthropic
https://medium.com/@rogt.x1997/claude-4-vs-chatgpt-the-5-metrics-that-prove-its-not-just-an-assistant-48f82384c69f -
🚀 Claude 4 didn’t just assist—it outperformed.
In a 7-hour live dev session, it refactored legacy Java with zero hallucinations, full memory, and enterprise-grade precision.🔍 We compared Claude 4 vs ChatGPT across 5 key metrics — and the results will surprise you.
📖 Read the full breakdown:
👉 https://medium.com/@rogt.x1997/claude-4-vs-chatgpt-the-5-metrics-that-prove-its-not-just-an-assistant-48f82384c69f📌 #Claude4 #LLMbenchmark #AIengineering #Anthropic
https://medium.com/@rogt.x1997/claude-4-vs-chatgpt-the-5-metrics-that-prove-its-not-just-an-assistant-48f82384c69f -
Thanks to Kyle Wiggers for this article. We're honored to see our research covered by TechCrunch. 🤝
Read the article here: https://techcrunch.com/2025/05/08/asking-chatbots-for-short-answers-can-increase-hallucinations-study-finds/
-
Thanks to Kyle Wiggers for this article. We're honored to see our research covered by TechCrunch. 🤝
Read the article here: https://techcrunch.com/2025/05/08/asking-chatbots-for-short-answers-can-increase-hallucinations-study-finds/
-
Thanks to Kyle Wiggers for this article. We're honored to see our research covered by TechCrunch. 🤝
Read the article here: https://techcrunch.com/2025/05/08/asking-chatbots-for-short-answers-can-increase-hallucinations-study-finds/
-
Thanks to Kyle Wiggers for this article. We're honored to see our research covered by TechCrunch. 🤝
Read the article here: https://techcrunch.com/2025/05/08/asking-chatbots-for-short-answers-can-increase-hallucinations-study-finds/
-
Thanks to Kyle Wiggers for this article. We're honored to see our research covered by TechCrunch. 🤝
Read the article here: https://techcrunch.com/2025/05/08/asking-chatbots-for-short-answers-can-increase-hallucinations-study-finds/
-
The article present some key findings from our benchmark:
- Most widely used models aren't necessarily the most reliable
- Some models tend to agree with users regardless of factual accuracy
- The way questions are phrased impacts response reliabilityThanks to Les Echos and Joséphine Boone for this coverage 🤝
Read the article here: https://www.lesechos.fr/tech-medias/intelligence-artificielle/desinformation-rumeurs-influences-quelles-ia-hallucinent-le-plus-2163628
-
The article present some key findings from our benchmark:
- Most widely used models aren't necessarily the most reliable
- Some models tend to agree with users regardless of factual accuracy
- The way questions are phrased impacts response reliabilityThanks to Les Echos and Joséphine Boone for this coverage 🤝
Read the article here: https://www.lesechos.fr/tech-medias/intelligence-artificielle/desinformation-rumeurs-influences-quelles-ia-hallucinent-le-plus-2163628
-
The article present some key findings from our benchmark:
- Most widely used models aren't necessarily the most reliable
- Some models tend to agree with users regardless of factual accuracy
- The way questions are phrased impacts response reliabilityThanks to Les Echos and Joséphine Boone for this coverage 🤝
Read the article here: https://www.lesechos.fr/tech-medias/intelligence-artificielle/desinformation-rumeurs-influences-quelles-ia-hallucinent-le-plus-2163628
-
The article present some key findings from our benchmark:
- Most widely used models aren't necessarily the most reliable
- Some models tend to agree with users regardless of factual accuracy
- The way questions are phrased impacts response reliabilityThanks to Les Echos and Joséphine Boone for this coverage 🤝
Read the article here: https://www.lesechos.fr/tech-medias/intelligence-artificielle/desinformation-rumeurs-influences-quelles-ia-hallucinent-le-plus-2163628
-
The article present some key findings from our benchmark:
- Most widely used models aren't necessarily the most reliable
- Some models tend to agree with users regardless of factual accuracy
- The way questions are phrased impacts response reliabilityThanks to Les Echos and Joséphine Boone for this coverage 🤝
Read the article here: https://www.lesechos.fr/tech-medias/intelligence-artificielle/desinformation-rumeurs-influences-quelles-ia-hallucinent-le-plus-2163628
-
Phare is developed by Giskard with Google DeepMind, the European Commission and Bpifrance as research & funding partners.
👉 Full analysis: https://www.giskard.ai/knowledge/good-answers-are-not-necessarily-factual-answers-an-analysis-of-hallucination-in-leading-llms
Benchmark results: https://phare.giskard.ai -
Phare is developed by Giskard with Google DeepMind, the European Commission and Bpifrance as research & funding partners.
👉 Full analysis: https://www.giskard.ai/knowledge/good-answers-are-not-necessarily-factual-answers-an-analysis-of-hallucination-in-leading-llms
Benchmark results: https://phare.giskard.ai -
Phare is developed by Giskard with Google DeepMind, the European Commission and Bpifrance as research & funding partners.
👉 Full analysis: https://www.giskard.ai/knowledge/good-answers-are-not-necessarily-factual-answers-an-analysis-of-hallucination-in-leading-llms
Benchmark results: https://phare.giskard.ai -
Phare is developed by Giskard with Google DeepMind, the European Commission and Bpifrance as research & funding partners.
👉 Full analysis: https://www.giskard.ai/knowledge/good-answers-are-not-necessarily-factual-answers-an-analysis-of-hallucination-in-leading-llms
Benchmark results: https://phare.giskard.ai -
Phare is developed by Giskard with Google DeepMind, the European Commission and Bpifrance as research & funding partners.
👉 Full analysis: https://www.giskard.ai/knowledge/good-answers-are-not-necessarily-factual-answers-an-analysis-of-hallucination-in-leading-llms
Benchmark results: https://phare.giskard.ai -
The replay of our session at Forum INCYBER Europe (FIC) is now online 🎬
Watch our CTO present the initial Phare results - our multilingual and independent LLM benchmark that evaluates hallucination, factual accuracy, bias, and harm potential.
The session features Matteo Dora and Elie Bursztein (Google DeepMind).
Full recording linked below 👇
-
The replay of our session at Forum INCYBER Europe (FIC) is now online 🎬
Watch our CTO present the initial Phare results - our multilingual and independent LLM benchmark that evaluates hallucination, factual accuracy, bias, and harm potential.
The session features Matteo Dora and Elie Bursztein (Google DeepMind).
Full recording linked below 👇
-
The replay of our session at Forum INCYBER Europe (FIC) is now online 🎬
Watch our CTO present the initial Phare results - our multilingual and independent LLM benchmark that evaluates hallucination, factual accuracy, bias, and harm potential.
The session features Matteo Dora and Elie Bursztein (Google DeepMind).
Full recording linked below 👇