home.social

#aibenchmark — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #aibenchmark, aggregated by home.social.

  1. Tested Cogito V1 14B Qwen on my Linux server. 45 t/s, 9.7GB VRAM, and the same IDA self-awareness trick its 8B sibling pulled -- Run 2 deliberately stepped back to brute force because a beginner probably needed simpler first. Run 3 came back stronger with a nice candy analogy. That's DeepCogito's IDA training making a transformation of Qwen into something way better.

    Read the full breakdown below.

    #LocalAI #Ollama #HomeLabAI #LLM #AIBenchmark

    goarcherdynamics.com/2026/04/0

  2. Tested Cogito V1 8B on my Linux server. 83 t/s, 5.4GB VRAM, 131k context. The real story is where it deliberately wrote worse code because it decided a beginner needed simplicity over efficiency -- and admitted it! That's IDA self-reflection making a live call.
    I guess a 5GB model with a conscience is worth more than a 70B model with none?

    Read the full breakdown below.

    #LocalAI #Ollama #HomeLabAI #LLM #AIBenchmark

    goarcherdynamics.com/2026/04/0

  3. Meituan Longcat vừa ra mắt AMO Bench, bộ tiêu chuẩn đánh giá AI Toán học. Theo đó, Kimi k2 Thinking được xác định là AI tốt nhất về giải toán. AMO Bench gồm 50 bài toán mới, độ khó cấp IMO, chấm điểm tự động chính xác cao.

    #AIBenchmark #MathAI #KimiK2Thinking #MeituanLongcat #TríTuệNhânTạo #ToánHọc

    reddit.com/r/LocalLLaMA/commen

  4. Die Grenzen von KI austesten

    Reuters & die New York Times berichten über einen neuen Test: Humanity's Last Exam. Mit 3.000 Fragen aus über 100 Themengebieten werden hier die Grenzen moderner KI-Systeme ausgetestet. Thorben Jansen vom IPN war an der Entwicklung beteiligt.

    🔗 Mehr: lastexam.ai

    New York Times: reuters.com/technology/artific

    Reuters: reuters.com/technology/artific

    #AI #AIBenchmark #KI #HumanitysLastExam