#evaluation — Public Fediverse posts on home.social

Benjamin Han @[email protected] · 2026-05-28 · 06:49 UTC

On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?

https://benjaminhan.net/posts/20260527-prompted-vs-trained-cot-abstention/?utm_source=mastodon&utm_medium=social

#Metacognition #LLMs #Reasoning #Evaluation #AI

#metacognition #llms #reasoning #evaluation #ai

Benjamin Han @[email protected] · 2026-05-28 · 06:49 UTC

On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?

https://benjaminhan.net/posts/20260527-prompted-vs-trained-cot-abstention/?utm_source=mastodon&utm_medium=social

#Metacognition #LLMs #Reasoning #Evaluation #AI

#metacognition #llms #reasoning #evaluation #ai

Benjamin Han @[email protected] · 2026-05-28 · 06:49 UTC

On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?

https://benjaminhan.net/posts/20260527-prompted-vs-trained-cot-abstention/?utm_source=mastodon&utm_medium=social

#Metacognition #LLMs #Reasoning #Evaluation #AI

#metacognition #llms #reasoning #evaluation #ai

Benjamin Han @[email protected] · 2026-05-28 · 06:49 UTC

On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?

https://benjaminhan.net/posts/20260527-prompted-vs-trained-cot-abstention/?utm_source=mastodon&utm_medium=social

#Metacognition #LLMs #Reasoning #Evaluation #AI

#ai #evaluation #reasoning #llms #metacognition

Benjamin Han @[email protected] · 2026-05-28 · 06:49 UTC

On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?

https://benjaminhan.net/posts/20260527-prompted-vs-trained-cot-abstention/?utm_source=mastodon&utm_medium=social

#Metacognition #LLMs #Reasoning #Evaluation #AI

#metacognition #llms #reasoning #evaluation #ai

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

https://benjaminhan.net/posts/20260523-triage-metacognitive-control/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

#paper #ai #llms #metacognition #evaluation #agenticsystems

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

https://benjaminhan.net/posts/20260523-triage-metacognitive-control/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

#paper #ai #llms #metacognition #evaluation #agenticsystems

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

https://benjaminhan.net/posts/20260523-triage-metacognitive-control/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

#ai #llms #metacognition #evaluation #agenticsystems #paper

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

https://benjaminhan.net/posts/20260523-triage-metacognitive-control/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

#agenticsystems #evaluation #metacognition #llms #ai #paper

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

https://benjaminhan.net/posts/20260523-triage-metacognitive-control/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

#paper #ai #llms #metacognition #evaluation #agenticsystems

Benjamin Han @[email protected] · 2026-05-23 · 01:51 UTC

Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social

#Metacognition #LLMs #Evaluation #AI

#llms #evaluation #ai #metacognition

Benjamin Han @[email protected] · 2026-05-23 · 01:51 UTC

Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social

#Metacognition #LLMs #Evaluation #AI

#metacognition #llms #evaluation #ai

Benjamin Han @[email protected] · 2026-05-23 · 01:51 UTC

Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social

#Metacognition #LLMs #Evaluation #AI

#metacognition #llms #evaluation #ai

Benjamin Han @[email protected] · 2026-05-23 · 01:51 UTC

Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social

#Metacognition #LLMs #Evaluation #AI

#ai #evaluation #llms #metacognition

Benjamin Han @[email protected] · 2026-05-23 · 01:51 UTC

Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social

#Metacognition #LLMs #Evaluation #AI

#metacognition #llms #evaluation #ai

European Defence Agency @[email protected] · 2026-05-21 · 17:02 UTC

Registrations remain open for the EDA Test & Evaluation Community Days 2026, taking place from 29 Sept. to 1 Oct. in Kiel 🇩🇪

Experts from government, armed forces, industry, research and academia will discuss EU cooperation in #Test and #Evaluation.

https://tecd.eda.europa.eu
---
https://nitter.net/EUDefenceAgency/status/2057011589195190396#m

#test #evaluation

European Defence Agency @[email protected] · 2026-05-21 · 17:02 UTC

Registrations remain open for the EDA Test & Evaluation Community Days 2026, taking place from 29 Sept. to 1 Oct. in Kiel 🇩🇪

Experts from government, armed forces, industry, research and academia will discuss EU cooperation in #Test and #Evaluation.

https://tecd.eda.europa.eu
---
https://nitter.net/EUDefenceAgency/status/2057011589195190396#m

#evaluation #test

European Defence Agency @[email protected] · 2026-05-21 · 17:02 UTC

Registrations remain open for the EDA Test & Evaluation Community Days 2026, taking place from 29 Sept. to 1 Oct. in Kiel 🇩🇪

Experts from government, armed forces, industry, research and academia will discuss EU cooperation in #Test and #Evaluation.

https://tecd.eda.europa.eu
---
https://nitter.net/EUDefenceAgency/status/2057011589195190396#m

#test #evaluation

LeidenForce @[email protected] · 2026-05-21 · 11:58 UTC

China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.

A shift away from “where you publish” toward “what you contribute”.

🔗 https://www.nature.com/articles/d41586-026-01216-1

#SciencePolicy #Research #Academia #Evaluation #Publishing

#sciencepolicy #research #academia #evaluation #publishing

LeidenForce @[email protected] · 2026-05-21 · 11:58 UTC

China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.

A shift away from “where you publish” toward “what you contribute”.

🔗 https://www.nature.com/articles/d41586-026-01216-1

#SciencePolicy #Research #Academia #Evaluation #Publishing

#sciencepolicy #research #academia #evaluation #publishing

LeidenForce @[email protected] · 2026-05-21 · 11:58 UTC

China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.

A shift away from “where you publish” toward “what you contribute”.

🔗 https://www.nature.com/articles/d41586-026-01216-1

#SciencePolicy #Research #Academia #Evaluation #Publishing

#sciencepolicy #research #academia #evaluation #publishing

LeidenForce @[email protected] · 2026-05-21 · 11:58 UTC

China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.

A shift away from “where you publish” toward “what you contribute”.

🔗 https://www.nature.com/articles/d41586-026-01216-1

#SciencePolicy #Research #Academia #Evaluation #Publishing

#publishing #evaluation #academia #research #sciencepolicy

LeidenForce @[email protected] · 2026-05-21 · 11:58 UTC

China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.

A shift away from “where you publish” toward “what you contribute”.

🔗 https://www.nature.com/articles/d41586-026-01216-1

#SciencePolicy #Research #Academia #Evaluation #Publishing

#sciencepolicy #research #academia #evaluation #publishing