Search

https://benjaminhan.net/posts/20260523-what-is-jepa/?utm_source=mastodon&utm_medium=social

#ai #jepa #worldmodels

Benjamin Han @[email protected] · 2026-05-24 · 05:24 UTC

Can a self-supervised model learn good visual representations without ever reconstructing pixels? JEPA, the program from FAIR now continued at AMI Labs, says yes by training the model to predict embeddings of missing data instead. This primer walks you through where JEPA came from, how it works, what's been demonstrated, and where it's headed.

https://benjaminhan.net/posts/20260523-what-is-jepa/?utm_source=mastodon&utm_medium=social

#ai #jepa #worldmodels

Benjamin Han @[email protected] · 2026-05-24 · 05:24 UTC

Can a self-supervised model learn good visual representations without ever reconstructing pixels? JEPA, the program from FAIR now continued at AMI Labs, says yes by training the model to predict embeddings of missing data instead. This primer walks you through where JEPA came from, how it works, what's been demonstrated, and where it's headed.

https://benjaminhan.net/posts/20260523-what-is-jepa/?utm_source=mastodon&utm_medium=social

#worldmodels #jepa #ai

Benjamin Han @[email protected] · 2026-05-24 · 05:24 UTC

Can a self-supervised model learn good visual representations without ever reconstructing pixels? JEPA, the program from FAIR now continued at AMI Labs, says yes by training the model to predict embeddings of missing data instead. This primer walks you through where JEPA came from, how it works, what's been demonstrated, and where it's headed.

https://benjaminhan.net/posts/20260523-triage-metacognitive-control/?utm_source=mastodon&utm_medium=social

#ai #jepa #worldmodels

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

#Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

#paper #ai #llms #metacognition #evaluation #agenticsystems

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

https://benjaminhan.net/posts/20260523-triage-metacognitive-control/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

#paper #ai #llms #metacognition #evaluation #agenticsystems

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

https://benjaminhan.net/posts/20260523-triage-metacognitive-control/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

#ai #llms #metacognition #evaluation #agenticsystems #paper

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

https://benjaminhan.net/posts/20260523-triage-metacognitive-control/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

#agenticsystems #evaluation #metacognition #llms #ai #paper

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

https://benjaminhan.net/posts/20260523-triage-metacognitive-control/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

#paper #ai #llms #metacognition #evaluation #agenticsystems

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know." Models hedge in CoT and commit anyway in the final answer.

https://benjaminhan.net/posts/20260523-abstentionbench-unanswerable-questions/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Benchmark #Reasoning #NeurIPS

#paper #ai #llms #metacognition #benchmark #reasoning

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know." Models hedge in CoT and commit anyway in the final answer.

https://benjaminhan.net/posts/20260523-abstentionbench-unanswerable-questions/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Benchmark #Reasoning #NeurIPS

#paper #ai #llms #metacognition #benchmark #reasoning

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know." Models hedge in CoT and commit anyway in the final answer.

https://benjaminhan.net/posts/20260523-abstentionbench-unanswerable-questions/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Benchmark #Reasoning #NeurIPS

#reasoning #neurips #paper #ai #llms #metacognition

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know." Models hedge in CoT and commit anyway in the final answer.

https://benjaminhan.net/posts/20260523-abstentionbench-unanswerable-questions/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Benchmark #Reasoning #NeurIPS

#neurips #reasoning #benchmark #metacognition #llms #ai

Benjamin Han @[email protected] · 2026-05-24 · 00:06 UTC

Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know." Models hedge in CoT and commit anyway in the final answer.

https://benjaminhan.net/posts/20260523-abstentionbench-unanswerable-questions/?utm_source=mastodon&utm_medium=social

#Paper #AI #LLMs #Metacognition #Benchmark #Reasoning #NeurIPS

#paper #ai #llms #metacognition #benchmark #reasoning

NFL News @[email protected] · 2026-05-23 · 09:35 UTC

Keionte Scott considered impact pick for the Bucs ahead of OTAs https://www.rawchili.com/nfl/898946/ #BenjaminMorrison #BleacherReport #Buccaneers #Football #GaryDavenport #JacobParrish #KeionteScott #LanceZierlein #NFL #TampaBay #TampaBayBuccaneers #TampaBay #TampaBayBuccaneers

#benjaminmorrison #bleacherreport #buccaneers #football #garydavenport #jacobparrish

Benjamin Han @[email protected] · 2026-05-23 · 01:53 UTC

What collapses frontier-LLM metacognition more — a vivid survival-threat narrative, or a single "do not refuse" suffix? Factorial isolation across 11 models says: the suffix, conclusively. 8 of 11 lose up to 30.2 accuracy points on refuse/clarify/flag tasks when forced to commit to a confident answer. Anthropic's Constitutional AI is the only family immune — same capability floor as Gemini.

https://benjaminhan.net/posts/20260522-compliance-trap/?utm_source=mastodon&utm_medium=social

#Metacognition #AISafety #LLMs #AI

#metacognition #aisafety #llms #ai

Benjamin Han @[email protected] · 2026-05-23 · 01:53 UTC

Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the external verifier the "cannot-self-correct" line has argued the loop needs.

https://benjaminhan.net/posts/20260522-metacognitive-harness/?utm_source=mastodon&utm_medium=social

https://benjaminhan.net/posts/20260522-metacognitive-harness/?utm_source=mastodon&utm_medium=social

#metacognition #reasoning #llms #ai

Benjamin Han @[email protected] · 2026-05-23 · 01:53 UTC

Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the external verifier the "cannot-self-correct" line has argued the loop needs.

https://benjaminhan.net/posts/20260522-metacognitive-harness/?utm_source=mastodon&utm_medium=social

#metacognition #reasoning #llms #ai

Benjamin Han @[email protected] · 2026-05-23 · 01:53 UTC

Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the external verifier the "cannot-self-correct" line has argued the loop needs.

https://benjaminhan.net/posts/20260522-metacognitive-harness/?utm_source=mastodon&utm_medium=social

#metacognition #reasoning #llms #ai

Benjamin Han @[email protected] · 2026-05-23 · 01:53 UTC

Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the external verifier the "cannot-self-correct" line has argued the loop needs.

https://benjaminhan.net/posts/20260522-metacognitive-harness/?utm_source=mastodon&utm_medium=social

#ai #llms #reasoning #metacognition

Benjamin Han @[email protected] · 2026-05-23 · 01:53 UTC

Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the external verifier the "cannot-self-correct" line has argued the loop needs.

https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social

#metacognition #reasoning #llms #ai

Benjamin Han @[email protected] · 2026-05-23 · 01:51 UTC

Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social

#llms #evaluation #ai #metacognition

Benjamin Han @[email protected] · 2026-05-23 · 01:51 UTC

Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social

#metacognition #llms #evaluation #ai

Benjamin Han @[email protected] · 2026-05-23 · 01:51 UTC

Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social

#metacognition #llms #evaluation #ai

Benjamin Han @[email protected] · 2026-05-23 · 01:51 UTC

Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

https://benjaminhan.net/posts/20260522-metacognition-atlas/?utm_source=mastodon&utm_medium=social

#ai #evaluation #llms #metacognition

Benjamin Han @[email protected] · 2026-05-23 · 01:51 UTC

Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.