home.social

#agenticsystems — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #agenticsystems, aggregated by home.social.

  1. Can frontier coding agents rebuild a program from scratch given only its executable and docs? No: a new 200-task benchmark finds that across nine models none fully resolves any task. The best passes 95% of tests on just 3% of them. Same models score well on bug-fix benchmarks but zero here, so headline progress numbers don't extrapolate.

    benjaminhan.net/posts/20260526

    #Paper #LLMs #AgenticSystems #SoftwareEngineering #AI

  2. Can frontier coding agents rebuild a program from scratch given only its executable and docs? No: a new 200-task benchmark finds that across nine models none fully resolves any task. The best passes 95% of tests on just 3% of them. Same models score well on bug-fix benchmarks but zero here, so headline progress numbers don't extrapolate.

    benjaminhan.net/posts/20260526

    #Paper #LLMs #AgenticSystems #SoftwareEngineering #AI

  3. Can frontier coding agents rebuild a program from scratch given only its executable and docs? No: a new 200-task benchmark finds that across nine models none fully resolves any task. The best passes 95% of tests on just 3% of them. Same models score well on bug-fix benchmarks but zero here, so headline progress numbers don't extrapolate.

    benjaminhan.net/posts/20260526

    #Paper #LLMs #AgenticSystems #SoftwareEngineering #AI

  4. Can frontier coding agents rebuild a program from scratch given only its executable and docs? No: a new 200-task benchmark finds that across nine models none fully resolves any task. The best passes 95% of tests on just 3% of them. Same models score well on bug-fix benchmarks but zero here, so headline progress numbers don't extrapolate.

    benjaminhan.net/posts/20260526

    #Paper #LLMs #AgenticSystems #SoftwareEngineering #AI

  5. Can frontier coding agents rebuild a program from scratch given only its executable and docs? No: a new 200-task benchmark finds that across nine models none fully resolves any task. The best passes 95% of tests on just 3% of them. Same models score well on bug-fix benchmarks but zero here, so headline progress numbers don't extrapolate.

    benjaminhan.net/posts/20260526

    #Paper #LLMs #AgenticSystems #SoftwareEngineering #AI

  6. Recursive Superintelligence emerged earlier this month with $650M+ at $4B+, eight founders from OpenAI/Meta/Salesforce, and Peter Norvig on board. The recursive-self-improvement category — Anthropic, OpenAI, AMI Labs, Ineffable, SSI, plus a $4B namesake — is now consolidating before any of them has a public technical milestone.

    benjaminhan.net/posts/20260526

    #AGI #LLMs #AgenticSystems #AI

  7. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  8. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  9. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  10. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  11. Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

    benjaminhan.net/posts/20260523

    #Paper #AI #LLMs #Metacognition #Evaluation #AgenticSystems

  12. 🚀 Oh great, just what we needed—yet another "revolutionary" software stack from a self-proclaimed tech messiah. 🎉 Wes McKinney leads a "small team of veterans" to invent the wheel, again, but this time with *agentic systems* and lots of 🚀 emojis. Get ready for a wild ride of #buzzwords and imaginary breakthroughs! 🙄
    kenn.io/ #techinnovation #softwaredevelopment #agenticsystems #WesMcKinney #HackerNews #ngated

  13. 🚀 Oh great, just what we needed—yet another "revolutionary" software stack from a self-proclaimed tech messiah. 🎉 Wes McKinney leads a "small team of veterans" to invent the wheel, again, but this time with *agentic systems* and lots of 🚀 emojis. Get ready for a wild ride of #buzzwords and imaginary breakthroughs! 🙄
    kenn.io/ #techinnovation #softwaredevelopment #agenticsystems #WesMcKinney #HackerNews #ngated

  14. 🚀 Oh great, just what we needed—yet another "revolutionary" software stack from a self-proclaimed tech messiah. 🎉 Wes McKinney leads a "small team of veterans" to invent the wheel, again, but this time with *agentic systems* and lots of 🚀 emojis. Get ready for a wild ride of #buzzwords and imaginary breakthroughs! 🙄
    kenn.io/ #techinnovation #softwaredevelopment #agenticsystems #WesMcKinney #HackerNews #ngated

  15. 🚀 Oh great, just what we needed—yet another "revolutionary" software stack from a self-proclaimed tech messiah. 🎉 Wes McKinney leads a "small team of veterans" to invent the wheel, again, but this time with *agentic systems* and lots of 🚀 emojis. Get ready for a wild ride of #buzzwords and imaginary breakthroughs! 🙄
    kenn.io/ #techinnovation #softwaredevelopment #agenticsystems #WesMcKinney #HackerNews #ngated

  16. 🚀 Oh great, just what we needed—yet another "revolutionary" software stack from a self-proclaimed tech messiah. 🎉 Wes McKinney leads a "small team of veterans" to invent the wheel, again, but this time with *agentic systems* and lots of 🚀 emojis. Get ready for a wild ride of #buzzwords and imaginary breakthroughs! 🙄
    kenn.io/ #techinnovation #softwaredevelopment #agenticsystems #WesMcKinney #HackerNews #ngated

  17. A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no retraining.

    benjaminhan.net/posts/20260520

    #ICLR #HumanInTheLoop #AgenticSystems #Metacognition #RL #AI

  18. A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no retraining.

    benjaminhan.net/posts/20260520

    #ICLR #HumanInTheLoop #AgenticSystems #Metacognition #RL #AI

  19. A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no retraining.

    benjaminhan.net/posts/20260520

    #ICLR #HumanInTheLoop #AgenticSystems #Metacognition #RL #AI

  20. A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no retraining.

    benjaminhan.net/posts/20260520

    #ICLR #HumanInTheLoop #AgenticSystems #Metacognition #RL #AI

  21. A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no retraining.

    benjaminhan.net/posts/20260520

    #ICLR #HumanInTheLoop #AgenticSystems #Metacognition #RL #AI

  22. MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story — both controller and bank are trained offline and frozen at deployment; only per-trace memory updates online.

    benjaminhan.net/posts/20260519

    #LLMs #AgenticSystems #RL #Metacognition #AI

  23. MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story — both controller and bank are trained offline and frozen at deployment; only per-trace memory updates online.

    benjaminhan.net/posts/20260519

    #LLMs #AgenticSystems #RL #Metacognition #AI

  24. MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story — both controller and bank are trained offline and frozen at deployment; only per-trace memory updates online.

    benjaminhan.net/posts/20260519

    #LLMs #AgenticSystems #RL #Metacognition #AI

  25. MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story — both controller and bank are trained offline and frozen at deployment; only per-trace memory updates online.

    benjaminhan.net/posts/20260519

    #LLMs #AgenticSystems #RL #Metacognition #AI

  26. MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story — both controller and bank are trained offline and frozen at deployment; only per-trace memory updates online.

    benjaminhan.net/posts/20260519

    #LLMs #AgenticSystems #RL #Metacognition #AI

  27. Simon Willison has stopped reviewing every line Claude Code writes, on production systems! He's candid about the trust drift.

    Going forward it'd be the harness and guardrails around the agent that can prevent disasters: sandboxes, write-fences, rollback paths.

    benjaminhan.net/posts/20260514

    #Coding #AgenticSystems #AIEngineering #AI

  28. Simon Willison has stopped reviewing every line Claude Code writes, on production systems! He's candid about the trust drift.

    Going forward it'd be the harness and guardrails around the agent that can prevent disasters: sandboxes, write-fences, rollback paths.

    benjaminhan.net/posts/20260514

    #Coding #AgenticSystems #AIEngineering #AI

  29. Simon Willison has stopped reviewing every line Claude Code writes, on production systems! He's candid about the trust drift.

    Going forward it'd be the harness and guardrails around the agent that can prevent disasters: sandboxes, write-fences, rollback paths.

    benjaminhan.net/posts/20260514

    #Coding #AgenticSystems #AIEngineering #AI

  30. Simon Willison has stopped reviewing every line Claude Code writes, on production systems! He's candid about the trust drift.

    Going forward it'd be the harness and guardrails around the agent that can prevent disasters: sandboxes, write-fences, rollback paths.

    benjaminhan.net/posts/20260514

    #Coding #AgenticSystems #AIEngineering #AI

  31. Simon Willison has stopped reviewing every line Claude Code writes, on production systems! He's candid about the trust drift.

    Going forward it'd be the harness and guardrails around the agent that can prevent disasters: sandboxes, write-fences, rollback paths.

    benjaminhan.net/posts/20260514

    #Coding #AgenticSystems #AIEngineering #AI

  32. Agent token cost grows quadratically in turns without caching, roughly linearly with caching. A new post fits those curves to SWE-bench traces on three models. Cross-model finding shows something interesting: Gemini 3 Flash takes 2× as many turns as GPT-5.2 or Opus 4.6, so its leaner per-turn verbosity (~300 tokens vs ~1,000) still burns more total tokens.

    benjaminhan.net/posts/20260513

    #AI #AgenticSystems #LLMs

  33. Agent token cost grows quadratically in turns without caching, roughly linearly with caching. A new post fits those curves to SWE-bench traces on three models. Cross-model finding shows something interesting: Gemini 3 Flash takes 2× as many turns as GPT-5.2 or Opus 4.6, so its leaner per-turn verbosity (~300 tokens vs ~1,000) still burns more total tokens.

    benjaminhan.net/posts/20260513

    #AI #AgenticSystems #LLMs

  34. Agent token cost grows quadratically in turns without caching, roughly linearly with caching. A new post fits those curves to SWE-bench traces on three models. Cross-model finding shows something interesting: Gemini 3 Flash takes 2× as many turns as GPT-5.2 or Opus 4.6, so its leaner per-turn verbosity (~300 tokens vs ~1,000) still burns more total tokens.

    benjaminhan.net/posts/20260513

    #AI #AgenticSystems #LLMs

  35. Agent token cost grows quadratically in turns without caching, roughly linearly with caching. A new post fits those curves to SWE-bench traces on three models. Cross-model finding shows something interesting: Gemini 3 Flash takes 2× as many turns as GPT-5.2 or Opus 4.6, so its leaner per-turn verbosity (~300 tokens vs ~1,000) still burns more total tokens.

    benjaminhan.net/posts/20260513

    #AI #AgenticSystems #LLMs

  36. Agent token cost grows quadratically in turns without caching, roughly linearly with caching. A new post fits those curves to SWE-bench traces on three models. Cross-model finding shows something interesting: Gemini 3 Flash takes 2× as many turns as GPT-5.2 or Opus 4.6, so its leaner per-turn verbosity (~300 tokens vs ~1,000) still burns more total tokens.

    benjaminhan.net/posts/20260513

    #AI #AgenticSystems #LLMs

  37. The Inference Shift: Ben Thompson splits "inference" into two workloads. Answer inference (human waiting) stays on premium GPUs; agentic inference (no human waiting) migrates to commodity memory hierarchy. Familiar shape: the 70s batch-off-mainframes migration may rerun on today's GPU clusters.

    benjaminhan.net/posts/20260511

    #AI #AgenticSystems #Inference #tech

  38. The Inference Shift: Ben Thompson splits "inference" into two workloads. Answer inference (human waiting) stays on premium GPUs; agentic inference (no human waiting) migrates to commodity memory hierarchy. Familiar shape: the 70s batch-off-mainframes migration may rerun on today's GPU clusters.

    benjaminhan.net/posts/20260511

    #AI #AgenticSystems #Inference #tech

  39. The Inference Shift: Ben Thompson splits "inference" into two workloads. Answer inference (human waiting) stays on premium GPUs; agentic inference (no human waiting) migrates to commodity memory hierarchy. Familiar shape: the 70s batch-off-mainframes migration may rerun on today's GPU clusters.

    benjaminhan.net/posts/20260511

    #AI #AgenticSystems #Inference #tech

  40. The Inference Shift: Ben Thompson splits "inference" into two workloads. Answer inference (human waiting) stays on premium GPUs; agentic inference (no human waiting) migrates to commodity memory hierarchy. Familiar shape: the 70s batch-off-mainframes migration may rerun on today's GPU clusters.

    benjaminhan.net/posts/20260511

    #AI #AgenticSystems #Inference #tech

  41. The Inference Shift: Ben Thompson splits "inference" into two workloads. Answer inference (human waiting) stays on premium GPUs; agentic inference (no human waiting) migrates to commodity memory hierarchy. Familiar shape: the 70s batch-off-mainframes migration may rerun on today's GPU clusters.

    benjaminhan.net/posts/20260511

    #AI #AgenticSystems #Inference #tech

  42. Singapore Researchers Harmonize Diverse SIEMs with Agentic Rule Translation

    Imagine having multiple Security Information and Event Management platforms working in perfect harmony - Singapore researchers have made this a reality by developing a game-changing approach called agentic rule translation, enabling seamless interoperability between diverse SIEMs.…

    osintsights.com/singapore-rese

    #SiemInteroperability #AgenticSystems #SecurityInformationAndEventManagement #Singapore #ResearchAndDevelopment

  43. Is AI displacement of white-collar work happening, or is the narrative ahead of the data?

    Two late-February pieces gave opposite answers. Citrini Research imagined a 2028 market crash from white-collar AI substitution. Citadel Securities answered two days later: adoption flat, job postings up, and demand collapse needs too many things to go wrong at once.

    Which side is right? Perhaps a 160-year-old paradox about coal can help answer.

    benjaminhan.net/posts/20260502

    #AI #FutureOfWork #AgenticSystems

  44. Is AI displacement of white-collar work happening, or is the narrative ahead of the data?

    Two late-February pieces gave opposite answers. Citrini Research imagined a 2028 market crash from white-collar AI substitution. Citadel Securities answered two days later: adoption flat, job postings up, and demand collapse needs too many things to go wrong at once.

    Which side is right? Perhaps a 160-year-old paradox about coal can help answer.

    benjaminhan.net/posts/20260502

    #AI #FutureOfWork #AgenticSystems

  45. Is AI displacement of white-collar work happening, or is the narrative ahead of the data?

    Two late-February pieces gave opposite answers. Citrini Research imagined a 2028 market crash from white-collar AI substitution. Citadel Securities answered two days later: adoption flat, job postings up, and demand collapse needs too many things to go wrong at once.

    Which side is right? Perhaps a 160-year-old paradox about coal can help answer.

    benjaminhan.net/posts/20260502

    #AI #FutureOfWork #AgenticSystems

  46. Is AI displacement of white-collar work happening, or is the narrative ahead of the data?

    Two late-February pieces gave opposite answers. Citrini Research imagined a 2028 market crash from white-collar AI substitution. Citadel Securities answered two days later: adoption flat, job postings up, and demand collapse needs too many things to go wrong at once.

    Which side is right? Perhaps a 160-year-old paradox about coal can help answer.

    benjaminhan.net/posts/20260502

    #AI #FutureOfWork #AgenticSystems

  47. Is AI displacement of white-collar work happening, or is the narrative ahead of the data?

    Two late-February pieces gave opposite answers. Citrini Research imagined a 2028 market crash from white-collar AI substitution. Citadel Securities answered two days later: adoption flat, job postings up, and demand collapse needs too many things to go wrong at once.

    Which side is right? Perhaps a 160-year-old paradox about coal can help answer.

    benjaminhan.net/posts/20260502

    #AI #FutureOfWork #AgenticSystems

  48. Whoa, hold onto your propeller hats, folks! 🤓 We've got a #GitHub project that's apparently a self-modifying, open-sourced "agentic system" (whatever that means) living in consumer hardware. Because clearly what we all need is more inscrutable tech jargon disguised as innovation! 🤖🔧
    github.com/ninjahawk/hollow-ag #Innovation #SelfModifyingTech #AgenticSystems #ConsumerHardware #TechJargon #HackerNews #ngated

  49. Whoa, hold onto your propeller hats, folks! 🤓 We've got a #GitHub project that's apparently a self-modifying, open-sourced "agentic system" (whatever that means) living in consumer hardware. Because clearly what we all need is more inscrutable tech jargon disguised as innovation! 🤖🔧
    github.com/ninjahawk/hollow-ag #Innovation #SelfModifyingTech #AgenticSystems #ConsumerHardware #TechJargon #HackerNews #ngated

  50. Whoa, hold onto your propeller hats, folks! 🤓 We've got a #GitHub project that's apparently a self-modifying, open-sourced "agentic system" (whatever that means) living in consumer hardware. Because clearly what we all need is more inscrutable tech jargon disguised as innovation! 🤖🔧
    github.com/ninjahawk/hollow-ag #Innovation #SelfModifyingTech #AgenticSystems #ConsumerHardware #TechJargon #HackerNews #ngated