#evaluation — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #evaluation, aggregated by home.social.
-
On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?
-
On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?
-
On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?
-
On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?
-
On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives?
-
Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.
-
Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.
-
Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.
-
Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.
-
Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.
-
Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.
https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social
-
Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.
https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social
-
Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.
https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social
-
Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.
https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social
-
Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.
https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social
-
Registrations remain open for the EDA Test & Evaluation Community Days 2026, taking place from 29 Sept. to 1 Oct. in Kiel 🇩🇪
Experts from government, armed forces, industry, research and academia will discuss EU cooperation in #Test and #Evaluation.
https://tecd.eda.europa.eu
---
https://nitter.net/EUDefenceAgency/status/2057011589195190396#m -
Registrations remain open for the EDA Test & Evaluation Community Days 2026, taking place from 29 Sept. to 1 Oct. in Kiel 🇩🇪
Experts from government, armed forces, industry, research and academia will discuss EU cooperation in #Test and #Evaluation.
https://tecd.eda.europa.eu
---
https://nitter.net/EUDefenceAgency/status/2057011589195190396#m -
Registrations remain open for the EDA Test & Evaluation Community Days 2026, taking place from 29 Sept. to 1 Oct. in Kiel 🇩🇪
Experts from government, armed forces, industry, research and academia will discuss EU cooperation in #Test and #Evaluation.
https://tecd.eda.europa.eu
---
https://nitter.net/EUDefenceAgency/status/2057011589195190396#m -
China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.
A shift away from “where you publish” toward “what you contribute”.
-
China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.
A shift away from “where you publish” toward “what you contribute”.
-
China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.
A shift away from “where you publish” toward “what you contribute”.
-
China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.
A shift away from “where you publish” toward “what you contribute”.
-
China stops updating a 22-year journal ranking system once used to evaluate and fund researchers.
A shift away from “where you publish” toward “what you contribute”.
-
The Green Party's parliamentary group has filed a motion demanding that the announced restructuring of the "Demokratie leben" (Democracy Lives) program be suspe... https://news.osna.fm/?p=46947 | #news #demand #democracy #evaluation #greens
-
The Green Party's parliamentary group has filed a motion demanding that the announced restructuring of the "Demokratie leben" (Democracy Lives) program be suspe... https://news.osna.fm/?p=46947 | #news #demand #democracy #evaluation #greens
-
The Green Party's parliamentary group has filed a motion demanding that the announced restructuring of the "Demokratie leben" (Democracy Lives) program be suspe... https://news.osna.fm/?p=46947 | #news #demand #democracy #evaluation #greens
-
The Green Party's parliamentary group has filed a motion demanding that the announced restructuring of the "Demokratie leben" (Democracy Lives) program be suspe... https://news.osna.fm/?p=46947 | #news #demand #democracy #evaluation #greens
-
The Green Party's parliamentary group has filed a motion demanding that the announced restructuring of the "Demokratie leben" (Democracy Lives) program be suspe... https://news.osna.fm/?p=46947 | #news #demand #democracy #evaluation #greens
-
https://www.europesays.com/ch-fr/135867/ L’OMS maintient son évaluation de l’hantavirus en « risque faible » #Actualités #ÉconomieEtFinances #épidémie #épidémieDeHantavirusMVHondius #évaluation #MaladiesContagieuses #MaladiesEtétatDeSanté #Monde #News #OMS #passager #risque #Santé #Suisse #surveillance #Transports #virus
-
Im Befragungsportal EVA Schule sind zwei neue Instrumente zu den Themen „Schutzkonzept“ und „Demokratische Schulkultur“ eingestellt. Mit EVA Schule können Umfragen im Kontext schulischer Qualitätsentwicklung durchgeführt werden. Befragt werden können jeweils Lehrkräfte, Schülerinnen und Schüler, Eltern und Pädagogisches Personal.
#Evaluation #schulischequalitätsentwicklung #Schutzkonzept #Demokratiebildung #EVASchule #ines #interneEvaluationinSchulen
-
Im Befragungsportal EVA Schule sind zwei neue Instrumente zu den Themen „Schutzkonzept“ und „Demokratische Schulkultur“ eingestellt. Mit EVA Schule können Umfragen im Kontext schulischer Qualitätsentwicklung durchgeführt werden. Befragt werden können jeweils Lehrkräfte, Schülerinnen und Schüler, Eltern und Pädagogisches Personal.
#Evaluation #schulischequalitätsentwicklung #Schutzkonzept #Demokratiebildung #EVASchule #ines #interneEvaluationinSchulen
-
Im Befragungsportal EVA Schule sind zwei neue Instrumente zu den Themen „Schutzkonzept“ und „Demokratische Schulkultur“ eingestellt. Mit EVA Schule können Umfragen im Kontext schulischer Qualitätsentwicklung durchgeführt werden. Befragt werden können jeweils Lehrkräfte, Schülerinnen und Schüler, Eltern und Pädagogisches Personal.
#Evaluation #schulischequalitätsentwicklung #Schutzkonzept #Demokratiebildung #EVASchule #ines #interneEvaluationinSchulen
-
https://www.europesays.com/ch-fr/135540/ Anthavirus: l’Organisation mondiale de la santé maintient son évaluation en « risque faible » #Actualités #ÉconomieEtFinances #épidémie #épidémieDeHantavirusMVHondius #évaluation #MaladiesContagieuses #MaladiesEtétatDeSanté #Monde #News #OMS #passager #risque #Santé #Suisse #surveillance #Transports #virus
-
Combining #water basins appears to allow the inter-basin #transfer of water without #environmental #assessment, #evaluation of cumulative effects, adequate watershed #management or #public #consultation beyond directly affected parties. In other words, the #legislation deems the two distinct water basins to be one. Expanding Ministerial #power for decision-making obviates the necessity for public consultation, environmental assessment and #parliamentary #debate.
5/24
-
Combining #water basins appears to allow the inter-basin #transfer of water without #environmental #assessment, #evaluation of cumulative effects, adequate watershed #management or #public #consultation beyond directly affected parties. In other words, the #legislation deems the two distinct water basins to be one. Expanding Ministerial #power for decision-making obviates the necessity for public consultation, environmental assessment and #parliamentary #debate.
5/24
-
Combining #water basins appears to allow the inter-basin #transfer of water without #environmental #assessment, #evaluation of cumulative effects, adequate watershed #management or #public #consultation beyond directly affected parties. In other words, the #legislation deems the two distinct water basins to be one. Expanding Ministerial #power for decision-making obviates the necessity for public consultation, environmental assessment and #parliamentary #debate.
5/24
-
Combining #water basins appears to allow the inter-basin #transfer of water without #environmental #assessment, #evaluation of cumulative effects, adequate watershed #management or #public #consultation beyond directly affected parties. In other words, the #legislation deems the two distinct water basins to be one. Expanding Ministerial #power for decision-making obviates the necessity for public consultation, environmental assessment and #parliamentary #debate.
5/24
-
Combining #water basins appears to allow the inter-basin #transfer of water without #environmental #assessment, #evaluation of cumulative effects, adequate watershed #management or #public #consultation beyond directly affected parties. In other words, the #legislation deems the two distinct water basins to be one. Expanding Ministerial #power for decision-making obviates the necessity for public consultation, environmental assessment and #parliamentary #debate.
5/24
-
Evaluation is a measurement problem. If you can't define what success looks like operationally, your evaluation framework is measuring noise.
-
Evaluation is a measurement problem. If you can't define what success looks like operationally, your evaluation framework is measuring noise.
-
https://www.europesays.com/ch-fr/129521/ Les tests de conduite des plus de 75 ans manquent de fiabilité, selon une étude #étude #évaluation #OFROU #Santé #ScienceEtTechnologie #Suisse #test #Test(Q1003030)(#98)
-
https://www.europesays.com/afrique/101428/ Financement des associations: beaucoup d’argent, peu d’équité et de transparence ##Audit ##Comptabilite ##courdescomptes ##Evaluation ##FinancementEtranger ##JusticeSociale ##Reddition ##SocieteCivile #Associations #budget #contrôle #Démocratie #financement #gouvernance #Maroc #ONG #Politique #régulation #subventions #transparence
-
100 美元超舊卡 NVIDIA V100 行 AI 效能實測 竟快過 RTX 3060
NVIDIA V100 跑 LLM 實測結果意外勝過多張消費級顯示卡。Hardware Have […]
#人工智能 #評測 #LLM #NVIDIA
https://unwire.hk/2026/05/11/v100-llm-test-results/ai/?utm_source=rss&utm_medium=rss&utm_campaign=v100-llm-test-results -
100 美元超舊卡 NVIDIA V100 行 AI 效能實測 竟快過 RTX 3060
NVIDIA V100 跑 LLM 實測結果意外勝過多張消費級顯示卡。Hardware Have […]
#人工智能 #評測 #LLM #NVIDIA
https://unwire.hk/2026/05/11/v100-llm-test-results/ai/?utm_source=rss&utm_medium=rss&utm_campaign=v100-llm-test-results -
100 美元超舊卡 NVIDIA V100 行 AI 效能實測 竟快過 RTX 3060
NVIDIA V100 跑 LLM 實測結果意外勝過多張消費級顯示卡。Hardware Have […]
#人工智能 #評測 #LLM #NVIDIA
https://unwire.hk/2026/05/11/v100-llm-test-results/ai/?utm_source=rss&utm_medium=rss&utm_campaign=v100-llm-test-results -
https://www.europesays.com/ch-fr/125628/ Physio- et ergothérapeutes, des rôles clés dans la rééducation des grands brûlés #autonomie #CentreDeSoins #cicatrisation #collaboration #constellation #CransMontana #DésastresEtAccidents #évaluation #famille #hôpital #HôpitauxEtCliniques #incendie #MaladiesEtétatDeSanté #Mobilité #patient #Peau #Santé #SoinsDeSanté #Suisse #VoiesRespiratoires
-
Most evaluations begin with questions, a pattern reflected in typical ToRs. Routine questions, to be answered one by one, delivering recommendations.
But now AI does this faster, cheaper. Plausible and convincing.
If evaluation remains a standardised Q&A, humans offer little added value.
The problem predates AI. We should have known that a sharper question, a reframed problem, is itself a finding.
AI is good at answers. The human contribution is the question worth asking. -
Most evaluations begin with questions, a pattern reflected in typical ToRs. Routine questions, to be answered one by one, delivering recommendations.
But now AI does this faster, cheaper. Plausible and convincing.
If evaluation remains a standardised Q&A, humans offer little added value.
The problem predates AI. We should have known that a sharper question, a reframed problem, is itself a finding.
AI is good at answers. The human contribution is the question worth asking. -
Most evaluations begin with questions, a pattern reflected in typical ToRs. Routine questions, to be answered one by one, delivering recommendations.
But now AI does this faster, cheaper. Plausible and convincing.
If evaluation remains a standardised Q&A, humans offer little added value.
The problem predates AI. We should have known that a sharper question, a reframed problem, is itself a finding.
AI is good at answers. The human contribution is the question worth asking. -
Most evaluations begin with questions, a pattern reflected in typical ToRs. Routine questions, to be answered one by one, delivering recommendations.
But now AI does this faster, cheaper. Plausible and convincing.
If evaluation remains a standardised Q&A, humans offer little added value.
The problem predates AI. We should have known that a sharper question, a reframed problem, is itself a finding.
AI is good at answers. The human contribution is the question worth asking.