#aibenchmarking — Public Fediverse posts on home.social

Analyst207 @[email protected] · 2026-05-14 · 21:39 UTC

AI Optimism Outpaces Evidence as Few Track Results

Most executives claim their AI initiatives are exceeding expectations, but surprisingly, fewer than half actually measure their results, leaving a gap between AI optimism and real-world impact. A new benchmarking framework aims to separate hype from reality, helping companies identify genuine AI success stories.

https://osintsights.com/ai-optimism-outpaces-evidence-as-few-track-results?utm_source=mastodon&utm_medium=social

#ArtificialIntelligence #AiBenchmarking #AiTracking #BusinessLeadership #EnterpriseTechnology

#artificialintelligence #aibenchmarking #aitracking #businessleadership #enterprisetechnology

AI Daily Post @[email protected] · 2026-02-08 · 13:43 UTC

New benchmark reveals that top multimodal models still stumble below 50% accuracy on basic visual entity tasks. The gap highlights limits in current vision‑language training and raises questions about real‑world reliability. Dive into the findings and what they mean for future AI research. #MultimodalLearning #VisionLanguage #EntityRecognition #AIBenchmarking

🔗 https://aidailypost.com/news/top-multimodal-models-fail-exceed-50-accuracy-basic-visual-entity

#multimodallearning #visionlanguage #entityrecognition #aibenchmarking

AI Daily Post @[email protected] · 2026-02-08 · 13:43 UTC

New benchmark reveals that top multimodal models still stumble below 50% accuracy on basic visual entity tasks. The gap highlights limits in current vision‑language training and raises questions about real‑world reliability. Dive into the findings and what they mean for future AI research. #MultimodalLearning #VisionLanguage #EntityRecognition #AIBenchmarking

🔗 https://aidailypost.com/news/top-multimodal-models-fail-exceed-50-accuracy-basic-visual-entity

#aibenchmarking #entityrecognition #visionlanguage #multimodallearning

AI Daily Post @[email protected] · 2026-02-08 · 13:43 UTC

New benchmark reveals that top multimodal models still stumble below 50% accuracy on basic visual entity tasks. The gap highlights limits in current vision‑language training and raises questions about real‑world reliability. Dive into the findings and what they mean for future AI research. #MultimodalLearning #VisionLanguage #EntityRecognition #AIBenchmarking

🔗 https://aidailypost.com/news/top-multimodal-models-fail-exceed-50-accuracy-basic-visual-entity

#multimodallearning #visionlanguage #entityrecognition #aibenchmarking

Mind Lude @[email protected] · 2025-09-25 · 12:51 UTC

Samsung just dropped TRUEBench, a new benchmark designed to actually measure how useful enterprise AI models are in the real world, not just how smart they sound on paper. Multilingual, real-task focused, and even co-developed by AI.

Finally, a benchmark that speaks fluent business. Check out the details: https://www.artificialintelligence-news.com/news/samsung-benchmarks-real-productivity-enterprise-ai-models/

What's your biggest AI productivity hurdle? #AIBenchmarking #EnterpriseAI #Samsung #LLMs #TechNews

#aibenchmarking #enterpriseai #samsung #llms #technews

Dining & Cooking @[email protected] · 2025-03-20 · 17:55 UTC

Nvidia Benchmark Recipes Bring Deep Insights In AI Performance https://www.diningandcooking.com/1968162/nvidia-benchmark-recipes-bring-deep-insights-in-ai-performance/ #ai #AIBenchmarking #DGXCloud #DGXCloudBenchmarkRecipes #GTC #GTC2025 #Nvidia #RecipeTopics #Recipes #tco

#ai #aibenchmarking #dgxcloud #dgxcloudbenchmarkrecipes #gtc #gtc2025

ResearchBuzz: Firehose @[email protected] · 2025-01-28 · 08:10 UTC

ZDNet: ‘Humanity’s Last Exam’ benchmark is stumping top AI models – can you do any better?. “On Thursday, Scale AI and the Center for AI Safety (CAIS) released Humanity’s Last Exam (HLE), a new academic benchmark aiming to ‘test the limits of AI knowledge at the frontiers of human expertise,’ Scale AI said in a release. The test consists of 3,000 text and multi-modal questions on more than […]

https://rbfirehose.com/2025/01/28/zdnet-humanitys-last-exam-benchmark-is-stumping-top-ai-models-can-you-do-any-better/

#ai #aibenchmarking #benchmarking #largelanguagemodelsllm_ #llm

Miguel Afonso Caetano @[email protected] · 2024-07-20 · 13:24 UTC

#AI #GenerativeAI #LLMs #AIBenchmarking: "Technology companies are locked in a frenzied arms race to release ever-more powerful artificial intelligence tools. To demonstrate that power, firms subject the tools to question-and-answer tests known as AI benchmarks and then brag about the results.

Google’s CEO, for example, said in December that a version of the company’s new large language model Gemini had “a score of 90.0%” on a benchmark known as Massive Multitask Language Understanding, making it “the first model to outperform human experts” on it. Not to be upstaged, Meta CEO Mark Zuckerberg was soon bragging that the latest version of his company’s Llama model “is already around 82 MMLU”

The problem, experts say, is that this test and others like it don’t tell you much, if anything, about an AI product — what sorts of questions it can reliably answer, when it can safely be used as substitute for a human expert, or how often it avoids “hallucinating” false answers. “The yardsticks are, like, pretty fundamentally broken,” said Maarten Sap, an assistant professor at Carnegie Mellon University and co-creator of a benchmark. The issues with them become especially worrisome, experts say, when companies advertise the results of evaluations for high-stakes topics like health care or law."

https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless

#ai #generativeai #llms #aibenchmarking