home.social

#llmbenchmark — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #llmbenchmark, aggregated by home.social.

  1. That's what I call a «meaningful» LLM benchmark. 😉

    (... or how to debunk the German meaning of «Intelligenz».)

    petergpt.github.io/bullshit-be

    #ai #llm #llmbenchmark #benchmark

  2. That's what I call a «meaningful» LLM benchmark. 😉

    (... or how to debunk the German meaning of «Intelligenz».)

    petergpt.github.io/bullshit-be

    #ai #llm #llmbenchmark #benchmark

  3. That's what I call a «meaningful» LLM benchmark. 😉

    (... or how to debunk the German meaning of «Intelligenz».)

    petergpt.github.io/bullshit-be

    #ai #llm #llmbenchmark #benchmark

  4. That's what I call a «meaningful» LLM benchmark. 😉

    (... or how to debunk the German meaning of «Intelligenz».)

    petergpt.github.io/bullshit-be

    #ai #llm #llmbenchmark #benchmark

  5. That's what I call a «meaningful» LLM benchmark. 😉

    (... or how to debunk the German meaning of «Intelligenz».)

    petergpt.github.io/bullshit-be

    #ai #llm #llmbenchmark #benchmark

  6. Google’s new Gemini 3.1 Pro claims to double its reasoning scores on the latest benchmark, pushing LLM capabilities further. Curious how this stacks up against other open‑source models? Dive into the details and see what the numbers reveal. #GoogleGemini #Gemini3_1 #LLMbenchmark #GenerativeAI

    🔗 aidailypost.com/news/google-ge

  7. Google’s new Gemini 3.1 Pro claims to double its reasoning scores on the latest benchmark, pushing LLM capabilities further. Curious how this stacks up against other open‑source models? Dive into the details and see what the numbers reveal. #GoogleGemini #Gemini3_1 #LLMbenchmark #GenerativeAI

    🔗 aidailypost.com/news/google-ge

  8. Google’s new Gemini 3.1 Pro claims to double its reasoning scores on the latest benchmark, pushing LLM capabilities further. Curious how this stacks up against other open‑source models? Dive into the details and see what the numbers reveal. #GoogleGemini #Gemini3_1 #LLMbenchmark #GenerativeAI

    🔗 aidailypost.com/news/google-ge

  9. Mô hình mở vs kín: Khoảng cách giữa điểm số và hiệu năng thực tế 🤖 #AI #MôHìnhLLM #DeepSeek #Grok #Claude

    Mở: Xếp hạng cao trên benchmark SWE nhưng dễ sai lệnh, cần giám sát kỹ.
    Kín (Claude 4.5 haiku): Tự lập, xử lý tài liệu dài & thực hiện nhiệm vụ phức tạp trơn tru.
    Câu hỏi: Ai cũng gặp vấn đề tương tự hay chỉ mình mình?

    #MôHìnhKín #HiệuNăngThựcTế #AIResearch #OpenSource #LLMBenchmark

    reddit.com/r/LocalLLaMA/commen

  10. Bàn về hiệu năng hệ thống AI workstation kép RTX PRO 6000 với 1.15TB RAM: So sánh xử lý GPU-only (INT4) vs CPU+GPU (fp8) trên mô hình MiniMax-M2.1. Kết quả: GPU-only nhanh hơn 2–4x ở prefill nhưng chỉ xử lý tối đa ~3 request đồng thời do giới hạn KV-cache..fp8 tuy chậm hơn nhưng mở rộng tốt hơn cho 10+ người dùng, đặc biệt với context dài. Queue time là điểm nghẽn quan trọng. Phù hợp cho agent coding nội bộ. #AIWorkstation #LLMBenchmark #MultiUserAI #GPUvsCPU #LocalLLM #HPC #MachineLearning #Tín

  11. The proof that #benchmarks on #LLM models are utterly useless.

    Maybe it's time to focus on real-world performance and practical applications instead of chasing numbers?

    #llm #ai #aibenchmarks #llmbenchmark #machinelearning #artificialintelligence #openai #gpt5 #chatgpt

  12. The proof that #benchmarks on #LLM models are utterly useless.

    Maybe it's time to focus on real-world performance and practical applications instead of chasing numbers?

    #llm #ai #aibenchmarks #llmbenchmark #machinelearning #artificialintelligence #openai #gpt5 #chatgpt

  13. The proof that #benchmarks on #LLM models are utterly useless.

    Maybe it's time to focus on real-world performance and practical applications instead of chasing numbers?

    #llm #ai #aibenchmarks #llmbenchmark #machinelearning #artificialintelligence #openai #gpt5 #chatgpt

  14. The proof that #benchmarks on #LLM models are utterly useless.

    Maybe it's time to focus on real-world performance and practical applications instead of chasing numbers?

    #llm #ai #aibenchmarks #llmbenchmark #machinelearning #artificialintelligence #openai #gpt5 #chatgpt

  15. The proof that #benchmarks on #LLM models are utterly useless.

    Maybe it's time to focus on real-world performance and practical applications instead of chasing numbers?

    #llm #ai #aibenchmarks #llmbenchmark #machinelearning #artificialintelligence #openai #gpt5 #chatgpt

  16. 🚀 Featured in L'Usine Digitale!

    Our independent multilingual LLM benchmark Phare was highlighted in an article detailing some key insights from our research.

    🔎 Key finding: LLMs perpetuate biases in their own content while recognizing those same biases when asked directly.

    Thanks to L'Usine Digitale and Célia Séramour for this coverage.
    Read here: gisk.ar/4lCHoUB

  17. 🚀 Featured in L'Usine Digitale!

    Our independent multilingual LLM benchmark Phare was highlighted in an article detailing some key insights from our research.

    🔎 Key finding: LLMs perpetuate biases in their own content while recognizing those same biases when asked directly.

    Thanks to L'Usine Digitale and Célia Séramour for this coverage.
    Read here: gisk.ar/4lCHoUB

    #LLMBenchmark #AISafety #AISecurity

  18. 🚀 Featured in L'Usine Digitale!

    Our independent multilingual LLM benchmark Phare was highlighted in an article detailing some key insights from our research.

    🔎 Key finding: LLMs perpetuate biases in their own content while recognizing those same biases when asked directly.

    Thanks to L'Usine Digitale and Célia Séramour for this coverage.
    Read here: gisk.ar/4lCHoUB

    #LLMBenchmark #AISafety #AISecurity

  19. 🚀 Featured in L'Usine Digitale!

    Our independent multilingual LLM benchmark Phare was highlighted in an article detailing some key insights from our research.

    🔎 Key finding: LLMs perpetuate biases in their own content while recognizing those same biases when asked directly.

    Thanks to L'Usine Digitale and Célia Séramour for this coverage.
    Read here: gisk.ar/4lCHoUB

    #LLMBenchmark #AISafety #AISecurity

  20. 🚀 Featured in L'Usine Digitale!

    Our independent multilingual LLM benchmark Phare was highlighted in an article detailing some key insights from our research.

    🔎 Key finding: LLMs perpetuate biases in their own content while recognizing those same biases when asked directly.

    Thanks to L'Usine Digitale and Célia Séramour for this coverage.
    Read here: gisk.ar/4lCHoUB

    #LLMBenchmark #AISafety #AISecurity

  21. 🚀 Claude 4 didn’t just assist—it outperformed.
    In a 7-hour live dev session, it refactored legacy Java with zero hallucinations, full memory, and enterprise-grade precision.

    🔍 We compared Claude 4 vs ChatGPT across 5 key metrics — and the results will surprise you.

    📖 Read the full breakdown:
    👉 medium.com/@rogt.x1997/claude-

    📌 #Claude4 #LLMbenchmark #AIengineering #Anthropic
    medium.com/@rogt.x1997/claude-

  22. 🚀 Claude 4 didn’t just assist—it outperformed.
    In a 7-hour live dev session, it refactored legacy Java with zero hallucinations, full memory, and enterprise-grade precision.

    🔍 We compared Claude 4 vs ChatGPT across 5 key metrics — and the results will surprise you.

    📖 Read the full breakdown:
    👉 medium.com/@rogt.x1997/claude-

    📌 #Claude4 #LLMbenchmark #AIengineering #Anthropic
    medium.com/@rogt.x1997/claude-

  23. Thanks to Kyle Wiggers for this article. We're honored to see our research covered by TechCrunch. 🤝

    Read the article here: techcrunch.com/2025/05/08/aski

  24. The article present some key findings from our benchmark:
    - Most widely used models aren't necessarily the most reliable
    - Some models tend to agree with users regardless of factual accuracy
    - The way questions are phrased impacts response reliability

    Thanks to Les Echos and Joséphine Boone for this coverage 🤝

    Read the article here: lesechos.fr/tech-medias/intell

  25. The article present some key findings from our benchmark:
    - Most widely used models aren't necessarily the most reliable
    - Some models tend to agree with users regardless of factual accuracy
    - The way questions are phrased impacts response reliability

    Thanks to Les Echos and Joséphine Boone for this coverage 🤝

    Read the article here: lesechos.fr/tech-medias/intell

    #AISecurity #LLMBenchmark #LesEchos

  26. The article present some key findings from our benchmark:
    - Most widely used models aren't necessarily the most reliable
    - Some models tend to agree with users regardless of factual accuracy
    - The way questions are phrased impacts response reliability

    Thanks to Les Echos and Joséphine Boone for this coverage 🤝

    Read the article here: lesechos.fr/tech-medias/intell

    #AISecurity #LLMBenchmark #LesEchos

  27. The article present some key findings from our benchmark:
    - Most widely used models aren't necessarily the most reliable
    - Some models tend to agree with users regardless of factual accuracy
    - The way questions are phrased impacts response reliability

    Thanks to Les Echos and Joséphine Boone for this coverage 🤝

    Read the article here: lesechos.fr/tech-medias/intell

    #AISecurity #LLMBenchmark #LesEchos

  28. The article present some key findings from our benchmark:
    - Most widely used models aren't necessarily the most reliable
    - Some models tend to agree with users regardless of factual accuracy
    - The way questions are phrased impacts response reliability

    Thanks to Les Echos and Joséphine Boone for this coverage 🤝

    Read the article here: lesechos.fr/tech-medias/intell

    #AISecurity #LLMBenchmark #LesEchos

  29. Phare is developed by Giskard with Google DeepMind, the European Commission and Bpifrance as research & funding partners.

    👉 Full analysis: giskard.ai/knowledge/good-answ
    Benchmark results: phare.giskard.ai

  30. Phare is developed by Giskard with Google DeepMind, the European Commission and Bpifrance as research & funding partners.

    👉 Full analysis: giskard.ai/knowledge/good-answ
    Benchmark results: phare.giskard.ai

    #AISecurity #LLMBenchmark #LLMs

  31. Phare is developed by Giskard with Google DeepMind, the European Commission and Bpifrance as research & funding partners.

    👉 Full analysis: giskard.ai/knowledge/good-answ
    Benchmark results: phare.giskard.ai

    #AISecurity #LLMBenchmark #LLMs

  32. Phare is developed by Giskard with Google DeepMind, the European Commission and Bpifrance as research & funding partners.

    👉 Full analysis: giskard.ai/knowledge/good-answ
    Benchmark results: phare.giskard.ai

    #AISecurity #LLMBenchmark #LLMs

  33. Phare is developed by Giskard with Google DeepMind, the European Commission and Bpifrance as research & funding partners.

    👉 Full analysis: giskard.ai/knowledge/good-answ
    Benchmark results: phare.giskard.ai

    #AISecurity #LLMBenchmark #LLMs

  34. The replay of our session at Forum INCYBER Europe (FIC) is now online 🎬

    Watch our CTO present the initial Phare results - our multilingual and independent LLM benchmark that evaluates hallucination, factual accuracy, bias, and harm potential.

    The session features Matteo Dora and Elie Bursztein (Google DeepMind).

    Full recording linked below 👇

  35. The replay of our session at Forum INCYBER Europe (FIC) is now online 🎬

    Watch our CTO present the initial Phare results - our multilingual and independent LLM benchmark that evaluates hallucination, factual accuracy, bias, and harm potential.

    The session features Matteo Dora and Elie Bursztein (Google DeepMind).

    Full recording linked below 👇

    #LLMBenchmark #AISecurity #ForumINCYBER #Research

  36. The replay of our session at Forum INCYBER Europe (FIC) is now online 🎬

    Watch our CTO present the initial Phare results - our multilingual and independent LLM benchmark that evaluates hallucination, factual accuracy, bias, and harm potential.

    The session features Matteo Dora and Elie Bursztein (Google DeepMind).

    Full recording linked below 👇

    #LLMBenchmark #AISecurity #ForumINCYBER #Research