home.social

#evaluation — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #evaluation, aggregated by home.social.

  1. On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?

    benjaminhan.net/posts/20260527

    #Metacognition #LLMs #Reasoning #Evaluation #AI

  2. On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?

    benjaminhan.net/posts/20260527

    #Metacognition #LLMs #Reasoning #Evaluation #AI

  3. On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?

    benjaminhan.net/posts/20260527

    #Metacognition #LLMs #Reasoning #Evaluation #AI

  4. On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?

    benjaminhan.net/posts/20260527

    #Metacognition #LLMs #Reasoning #Evaluation #AI

  5. On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?

    benjaminhan.net/posts/20260527

    #Metacognition #LLMs #Reasoning #Evaluation #AI

  6. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  7. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  8. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  9. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  10. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  11. Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

    benjaminhan.net/posts/20260522

    #Metacognition #LLMs #Evaluation #AI

  12. Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

    benjaminhan.net/posts/20260522

    #Metacognition #LLMs #Evaluation #AI

  13. Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

    benjaminhan.net/posts/20260522

    #Metacognition #LLMs #Evaluation #AI

  14. Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

    benjaminhan.net/posts/20260522

    #Metacognition #LLMs #Evaluation #AI

  15. Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

    benjaminhan.net/posts/20260522

    #Metacognition #LLMs #Evaluation #AI

  16. Registrations remain open for the EDA Test & Evaluation Community Days 2026, taking place from 29 Sept. to 1 Oct. in Kiel 🇩🇪

    Experts from government, armed forces, industry, research and academia will discuss EU cooperation in #Test and #Evaluation.

    tecd.eda.europa.eu
    ---
    nitter.net/EUDefenceAgency/sta

  17. Registrations remain open for the EDA Test & Evaluation Community Days 2026, taking place from 29 Sept. to 1 Oct. in Kiel 🇩🇪

    Experts from government, armed forces, industry, research and academia will discuss EU cooperation in #Test and #Evaluation.

    tecd.eda.europa.eu
    ---
    nitter.net/EUDefenceAgency/sta

  18. Registrations remain open for the EDA Test & Evaluation Community Days 2026, taking place from 29 Sept. to 1 Oct. in Kiel 🇩🇪

    Experts from government, armed forces, industry, research and academia will discuss EU cooperation in #Test and #Evaluation.

    tecd.eda.europa.eu
    ---
    nitter.net/EUDefenceAgency/sta

  19. China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.

    A shift away from “where you publish” toward “what you contribute”.

    🔗 nature.com/articles/d41586-026

    #SciencePolicy #Research #Academia #Evaluation #Publishing

  20. China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.

    A shift away from “where you publish” toward “what you contribute”.

    🔗 nature.com/articles/d41586-026

    #SciencePolicy #Research #Academia #Evaluation #Publishing

  21. China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.

    A shift away from “where you publish” toward “what you contribute”.

    🔗 nature.com/articles/d41586-026

    #SciencePolicy #Research #Academia #Evaluation #Publishing

  22. China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.

    A shift away from “where you publish” toward “what you contribute”.

    🔗 nature.com/articles/d41586-026

    #SciencePolicy #Research #Academia #Evaluation #Publishing

  23. China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.

    A shift away from “where you publish” toward “what you contribute”.

    🔗 nature.com/articles/d41586-026

    #SciencePolicy #Research #Academia #Evaluation #Publishing

  24. The Green Party's parliamentary group has filed a motion demanding that the announced restructuring of the "Demokratie leben" (Democracy Lives) program be suspe... news.osna.fm/?p=46947 | #news #demand #democracy #evaluation #greens

  25. The Green Party's parliamentary group has filed a motion demanding that the announced restructuring of the "Demokratie leben" (Democracy Lives) program be suspe... news.osna.fm/?p=46947 | #news #demand #democracy #evaluation #greens

  26. The Green Party's parliamentary group has filed a motion demanding that the announced restructuring of the "Demokratie leben" (Democracy Lives) program be suspe... news.osna.fm/?p=46947 | #news #demand #democracy #evaluation #greens

  27. The Green Party's parliamentary group has filed a motion demanding that the announced restructuring of the "Demokratie leben" (Democracy Lives) program be suspe... news.osna.fm/?p=46947 | #news #demand #democracy #evaluation #greens

  28. The Green Party's parliamentary group has filed a motion demanding that the announced restructuring of the "Demokratie leben" (Democracy Lives) program be suspe... news.osna.fm/?p=46947 | #news #demand #democracy #evaluation #greens

  29. Im Befragungsportal EVA Schule sind zwei neue Instrumente zu den Themen „Schutzkonzept“ und „Demokratische Schulkultur“ eingestellt. Mit EVA Schule können Umfragen im Kontext schulischer Qualitätsentwicklung durchgeführt werden. Befragt werden können jeweils Lehrkräfte, Schülerinnen und Schüler, Eltern und Pädagogisches Personal.

    #Evaluation #schulischequalitätsentwicklung #Schutzkonzept #Demokratiebildung #EVASchule #ines #interneEvaluationinSchulen

  30. Im Befragungsportal EVA Schule sind zwei neue Instrumente zu den Themen „Schutzkonzept“ und „Demokratische Schulkultur“ eingestellt. Mit EVA Schule können Umfragen im Kontext schulischer Qualitätsentwicklung durchgeführt werden. Befragt werden können jeweils Lehrkräfte, Schülerinnen und Schüler, Eltern und Pädagogisches Personal.

    #Evaluation #schulischequalitätsentwicklung #Schutzkonzept #Demokratiebildung #EVASchule #ines #interneEvaluationinSchulen

  31. Im Befragungsportal EVA Schule sind zwei neue Instrumente zu den Themen „Schutzkonzept“ und „Demokratische Schulkultur“ eingestellt. Mit EVA Schule können Umfragen im Kontext schulischer Qualitätsentwicklung durchgeführt werden. Befragt werden können jeweils Lehrkräfte, Schülerinnen und Schüler, Eltern und Pädagogisches Personal.

    #Evaluation #schulischequalitätsentwicklung #Schutzkonzept #Demokratiebildung #EVASchule #ines #interneEvaluationinSchulen

  32. Combining basins appears to allow the inter-basin of water without , of cumulative effects, adequate watershed or beyond directly affected parties. In other words, the deems the two distinct water basins to be one. Expanding Ministerial for decision-making obviates the necessity for public consultation, environmental assessment and .

    5/24

  33. Combining #water basins appears to allow the inter-basin #transfer of water without #environmental #assessment, #evaluation of cumulative effects, adequate watershed #management or #public #consultation beyond directly affected parties. In other words, the #legislation deems the two distinct water basins to be one. Expanding Ministerial #power for decision-making obviates the necessity for public consultation, environmental assessment and #parliamentary #debate.

    5/24

  34. Combining #water basins appears to allow the inter-basin #transfer of water without #environmental #assessment, #evaluation of cumulative effects, adequate watershed #management or #public #consultation beyond directly affected parties. In other words, the #legislation deems the two distinct water basins to be one. Expanding Ministerial #power for decision-making obviates the necessity for public consultation, environmental assessment and #parliamentary #debate.

    5/24

  35. Combining #water basins appears to allow the inter-basin #transfer of water without #environmental #assessment, #evaluation of cumulative effects, adequate watershed #management or #public #consultation beyond directly affected parties. In other words, the #legislation deems the two distinct water basins to be one. Expanding Ministerial #power for decision-making obviates the necessity for public consultation, environmental assessment and #parliamentary #debate.

    5/24

  36. Combining #water basins appears to allow the inter-basin #transfer of water without #environmental #assessment, #evaluation of cumulative effects, adequate watershed #management or #public #consultation beyond directly affected parties. In other words, the #legislation deems the two distinct water basins to be one. Expanding Ministerial #power for decision-making obviates the necessity for public consultation, environmental assessment and #parliamentary #debate.

    5/24

  37. Evaluation is a measurement problem. If you can't define what success looks like operationally, your evaluation framework is measuring noise.

  38. Evaluation is a measurement problem. If you can't define what success looks like operationally, your evaluation framework is measuring noise.

    #Evaluation #Measurement #AI

  39. 100 美元超舊卡 NVIDIA V100 行 AI 效能實測 竟快過 RTX 3060
      NVIDIA V100 跑 LLM 實測結果意外勝過多張消費級顯示卡。Hardware Have […]
    #人工智能 #評測 #LLM #NVIDIA
    unwire.hk/2026/05/11/v100-llm-

  40. 100 美元超舊卡 NVIDIA V100 行 AI 效能實測 竟快過 RTX 3060
      NVIDIA V100 跑 LLM 實測結果意外勝過多張消費級顯示卡。Hardware Have […]
    #人工智能 #評測 #LLM #NVIDIA
    unwire.hk/2026/05/11/v100-llm-

  41. 100 美元超舊卡 NVIDIA V100 行 AI 效能實測 竟快過 RTX 3060
      NVIDIA V100 跑 LLM 實測結果意外勝過多張消費級顯示卡。Hardware Have […]
    #人工智能 #評測 #LLM #NVIDIA
    unwire.hk/2026/05/11/v100-llm-

  42. Most evaluations begin with questions, a pattern reflected in typical ToRs. Routine questions, to be answered one by one, delivering recommendations.
    But now AI does this faster, cheaper. Plausible and convincing.
    If evaluation remains a standardised Q&A, humans offer little added value.
    The problem predates AI. We should have known that a sharper question, a reframed problem, is itself a finding.
    AI is good at answers. The human contribution is the question worth asking.

    #evaluation #AI

  43. Most evaluations begin with questions, a pattern reflected in typical ToRs. Routine questions, to be answered one by one, delivering recommendations.
    But now AI does this faster, cheaper. Plausible and convincing.
    If evaluation remains a standardised Q&A, humans offer little added value.
    The problem predates AI. We should have known that a sharper question, a reframed problem, is itself a finding.
    AI is good at answers. The human contribution is the question worth asking.

    #evaluation #AI

  44. Most evaluations begin with questions, a pattern reflected in typical ToRs. Routine questions, to be answered one by one, delivering recommendations.
    But now AI does this faster, cheaper. Plausible and convincing.
    If evaluation remains a standardised Q&A, humans offer little added value.
    The problem predates AI. We should have known that a sharper question, a reframed problem, is itself a finding.
    AI is good at answers. The human contribution is the question worth asking.

    #evaluation #AI

  45. Most evaluations begin with questions, a pattern reflected in typical ToRs. Routine questions, to be answered one by one, delivering recommendations.
    But now AI does this faster, cheaper. Plausible and convincing.
    If evaluation remains a standardised Q&A, humans offer little added value.
    The problem predates AI. We should have known that a sharper question, a reframed problem, is itself a finding.
    AI is good at answers. The human contribution is the question worth asking.

    #evaluation #AI