home.social

Search

1000 results for “Benja”

  1. This #LongRunSunday I'm back to #Snoqualmie Valley Trail, but starting from Carnation southbound and ran a half marathon: 6.6mi up and 6.6mi down, but walking the last mile for recovery. Pace: 10'27”/mi up and 9’09”/mi down.

    This is the longest distance I've run in the past couple months due to injury. VO2max finally started climbing back a little to 46.4.

    #Video recap: youtube.com/watch?v=dDwljQWGe1
    More photos: benjaminhan.net/posts/20260524

    #Running #Trailrunning #Photo #PNW

  2. This #LongRunSunday I'm back to #Snoqualmie Valley Trail, but starting from Carnation southbound and ran a half marathon: 6.6mi up and 6.6mi down, but walking the last mile for recovery. Pace: 10'27”/mi up and 9’09”/mi down.

    This is the longest distance I've run in the past couple months due to injury. VO2max finally started climbing back a little to 46.4.

    #Video recap: youtube.com/watch?v=dDwljQWGe1
    More photos: benjaminhan.net/posts/20260524

    #Running #Trailrunning #Photo #PNW

  3. Is AI going to displace human labor, and what's the consequence if it does? Daron Acemoglu, MIT Institute Professor and 2024 Nobel laureate, makes the case in this 37-min interview: AI is being pushed to replace workers rather than augment them, productivity gains aren't showing up in firms adopting it, the bubble looks real and macro-fragile, and getting AI's direction wrong may shape liberal democracy's future.

    benjaminhan.net/posts/20260524

    #AI #LLMs #Economics #Jobs #FutureOfWork #Society #Policy

  4. Benjamin Netanyahu says Donald Trump backs dismantling Iran’s nuclear enrichment sites

    Prime Minister Benjamin Netanyahu spoke with US President Donald Trump, who agreed that Iran’s nuclear enrichment sites will…
    #Conflict #Conflicts #War #BenjaminNetanyahu #DonaldTrump #Iran #IslamabadDeclaration #Israel #middleeast #middleeastcrisis #nuclear #nuclearbomb #OperationEpicFury #OperationRoaringLion #uranium
    europesays.com/3014388/

  5. Can a self-supervised model learn good visual representations without ever reconstructing pixels? JEPA, the program from FAIR now continued at AMI Labs, says yes by training the model to predict embeddings of missing data instead. This primer walks you through where JEPA came from, how it works, what's been demonstrated, and where it's headed.

    benjaminhan.net/posts/20260523

    #AI #JEPA #WorldModels

  6. Can a self-supervised model learn good visual representations without ever reconstructing pixels? JEPA, the program from FAIR now continued at AMI Labs, says yes by training the model to predict embeddings of missing data instead. This primer walks you through where JEPA came from, how it works, what's been demonstrated, and where it's headed.

    benjaminhan.net/posts/20260523

    #AI #JEPA #WorldModels

  7. Can a self-supervised model learn good visual representations without ever reconstructing pixels? JEPA, the program from FAIR now continued at AMI Labs, says yes by training the model to predict embeddings of missing data instead. This primer walks you through where JEPA came from, how it works, what's been demonstrated, and where it's headed.

    benjaminhan.net/posts/20260523

    #AI #JEPA #WorldModels

  8. Can a self-supervised model learn good visual representations without ever reconstructing pixels? JEPA, the program from FAIR now continued at AMI Labs, says yes by training the model to predict embeddings of missing data instead. This primer walks you through where JEPA came from, how it works, what's been demonstrated, and where it's headed.

    benjaminhan.net/posts/20260523

    #AI #JEPA #WorldModels

  9. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  10. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  11. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  12. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  13. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  14. Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know." Models hedge in CoT and commit anyway in the final answer.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Benchmark #Reasoning #NeurIPS

  15. Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know." Models hedge in CoT and commit anyway in the final answer.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Benchmark #Reasoning #NeurIPS

  16. Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know." Models hedge in CoT and commit anyway in the final answer.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Benchmark #Reasoning #NeurIPS

  17. Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know." Models hedge in CoT and commit anyway in the final answer.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Benchmark #Reasoning #NeurIPS

  18. Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know." Models hedge in CoT and commit anyway in the final answer.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Benchmark #Reasoning #NeurIPS

  19. What collapses frontier-LLM metacognition more — a vivid survival-threat narrative, or a single "do not refuse" suffix? Factorial isolation across 11 models says: the suffix, conclusively. 8 of 11 lose up to 30.2 accuracy points on refuse/clarify/flag tasks when forced to commit to a confident answer. Anthropic's Constitutional AI is the only family immune — same capability floor as Gemini.

    benjaminhan.net/posts/20260522

    #Metacognition #AISafety #LLMs #AI

  20. Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the external verifier the "cannot-self-correct" line has argued the loop needs.

    benjaminhan.net/posts/20260522

    #Metacognition #Reasoning #LLMs #AI

  21. Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the external verifier the "cannot-self-correct" line has argued the loop needs.

    benjaminhan.net/posts/20260522

    #Metacognition #Reasoning #LLMs #AI

  22. Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the external verifier the "cannot-self-correct" line has argued the loop needs.

    benjaminhan.net/posts/20260522

    #Metacognition #Reasoning #LLMs #AI

  23. Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the external verifier the "cannot-self-correct" line has argued the loop needs.

    benjaminhan.net/posts/20260522

    #Metacognition #Reasoning #LLMs #AI

  24. Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the external verifier the "cannot-self-correct" line has argued the loop needs.

    benjaminhan.net/posts/20260522

    #Metacognition #Reasoning #LLMs #AI

  25. Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

    benjaminhan.net/posts/20260522

    #Metacognition #LLMs #Evaluation #AI

  26. Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

    benjaminhan.net/posts/20260522

    #Metacognition #LLMs #Evaluation #AI

  27. Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

    benjaminhan.net/posts/20260522

    #Metacognition #LLMs #Evaluation #AI

  28. Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

    benjaminhan.net/posts/20260522

    #Metacognition #LLMs #Evaluation #AI

  29. Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel; Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

    benjaminhan.net/posts/20260522

    #Metacognition #LLMs #Evaluation #AI