#agenticsystems — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #agenticsystems, aggregated by home.social.
-
Can frontier coding agents rebuild a program from scratch given only its executable and docs? No: a new 200-task benchmark finds that across nine models none fully resolves any task. The best passes 95% of tests on just 3% of them. Same models score well on bug-fix benchmarks but zero here, so headline progress numbers don't extrapolate.
-
Can frontier coding agents rebuild a program from scratch given only its executable and docs? No: a new 200-task benchmark finds that across nine models none fully resolves any task. The best passes 95% of tests on just 3% of them. Same models score well on bug-fix benchmarks but zero here, so headline progress numbers don't extrapolate.
-
Can frontier coding agents rebuild a program from scratch given only its executable and docs? No: a new 200-task benchmark finds that across nine models none fully resolves any task. The best passes 95% of tests on just 3% of them. Same models score well on bug-fix benchmarks but zero here, so headline progress numbers don't extrapolate.
-
Can frontier coding agents rebuild a program from scratch given only its executable and docs? No: a new 200-task benchmark finds that across nine models none fully resolves any task. The best passes 95% of tests on just 3% of them. Same models score well on bug-fix benchmarks but zero here, so headline progress numbers don't extrapolate.
-
Can frontier coding agents rebuild a program from scratch given only its executable and docs? No: a new 200-task benchmark finds that across nine models none fully resolves any task. The best passes 95% of tests on just 3% of them. Same models score well on bug-fix benchmarks but zero here, so headline progress numbers don't extrapolate.
-
Recursive Superintelligence emerged earlier this month with $650M+ at $4B+, eight founders from OpenAI/Meta/Salesforce, and Peter Norvig on board. The recursive-self-improvement category — Anthropic, OpenAI, AMI Labs, Ineffable, SSI, plus a $4B namesake — is now consolidating before any of them has a public technical milestone.
-
Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.
-
Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.
-
Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.
-
Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.
-
Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.
-
🚀 Oh great, just what we needed—yet another "revolutionary" software stack from a self-proclaimed tech messiah. 🎉 Wes McKinney leads a "small team of veterans" to invent the wheel, again, but this time with *agentic systems* and lots of 🚀 emojis. Get ready for a wild ride of #buzzwords and imaginary breakthroughs! 🙄
https://kenn.io/ #techinnovation #softwaredevelopment #agenticsystems #WesMcKinney #HackerNews #ngated -
🚀 Oh great, just what we needed—yet another "revolutionary" software stack from a self-proclaimed tech messiah. 🎉 Wes McKinney leads a "small team of veterans" to invent the wheel, again, but this time with *agentic systems* and lots of 🚀 emojis. Get ready for a wild ride of #buzzwords and imaginary breakthroughs! 🙄
https://kenn.io/ #techinnovation #softwaredevelopment #agenticsystems #WesMcKinney #HackerNews #ngated -
🚀 Oh great, just what we needed—yet another "revolutionary" software stack from a self-proclaimed tech messiah. 🎉 Wes McKinney leads a "small team of veterans" to invent the wheel, again, but this time with *agentic systems* and lots of 🚀 emojis. Get ready for a wild ride of #buzzwords and imaginary breakthroughs! 🙄
https://kenn.io/ #techinnovation #softwaredevelopment #agenticsystems #WesMcKinney #HackerNews #ngated -
🚀 Oh great, just what we needed—yet another "revolutionary" software stack from a self-proclaimed tech messiah. 🎉 Wes McKinney leads a "small team of veterans" to invent the wheel, again, but this time with *agentic systems* and lots of 🚀 emojis. Get ready for a wild ride of #buzzwords and imaginary breakthroughs! 🙄
https://kenn.io/ #techinnovation #softwaredevelopment #agenticsystems #WesMcKinney #HackerNews #ngated -
🚀 Oh great, just what we needed—yet another "revolutionary" software stack from a self-proclaimed tech messiah. 🎉 Wes McKinney leads a "small team of veterans" to invent the wheel, again, but this time with *agentic systems* and lots of 🚀 emojis. Get ready for a wild ride of #buzzwords and imaginary breakthroughs! 🙄
https://kenn.io/ #techinnovation #softwaredevelopment #agenticsystems #WesMcKinney #HackerNews #ngated -
A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no retraining.
#ICLR #HumanInTheLoop #AgenticSystems #Metacognition #RL #AI
-
A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no retraining.
#ICLR #HumanInTheLoop #AgenticSystems #Metacognition #RL #AI
-
A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no retraining.
#ICLR #HumanInTheLoop #AgenticSystems #Metacognition #RL #AI
-
A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no retraining.
#ICLR #HumanInTheLoop #AgenticSystems #Metacognition #RL #AI
-
A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no retraining.
#ICLR #HumanInTheLoop #AgenticSystems #Metacognition #RL #AI
-
MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story — both controller and bank are trained offline and frozen at deployment; only per-trace memory updates online.
https://benjaminhan.net/posts/20260519-memskill/?utm_source=mastodon&utm_medium=social
-
MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story — both controller and bank are trained offline and frozen at deployment; only per-trace memory updates online.
https://benjaminhan.net/posts/20260519-memskill/?utm_source=mastodon&utm_medium=social
-
MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story — both controller and bank are trained offline and frozen at deployment; only per-trace memory updates online.
https://benjaminhan.net/posts/20260519-memskill/?utm_source=mastodon&utm_medium=social
-
MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story — both controller and bank are trained offline and frozen at deployment; only per-trace memory updates online.
https://benjaminhan.net/posts/20260519-memskill/?utm_source=mastodon&utm_medium=social
-
MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story — both controller and bank are trained offline and frozen at deployment; only per-trace memory updates online.
https://benjaminhan.net/posts/20260519-memskill/?utm_source=mastodon&utm_medium=social
-
Simon Willison has stopped reviewing every line Claude Code writes, on production systems! He's candid about the trust drift.
Going forward it'd be the harness and guardrails around the agent that can prevent disasters: sandboxes, write-fences, rollback paths.
-
Simon Willison has stopped reviewing every line Claude Code writes, on production systems! He's candid about the trust drift.
Going forward it'd be the harness and guardrails around the agent that can prevent disasters: sandboxes, write-fences, rollback paths.
-
Simon Willison has stopped reviewing every line Claude Code writes, on production systems! He's candid about the trust drift.
Going forward it'd be the harness and guardrails around the agent that can prevent disasters: sandboxes, write-fences, rollback paths.
-
Simon Willison has stopped reviewing every line Claude Code writes, on production systems! He's candid about the trust drift.
Going forward it'd be the harness and guardrails around the agent that can prevent disasters: sandboxes, write-fences, rollback paths.
-
Simon Willison has stopped reviewing every line Claude Code writes, on production systems! He's candid about the trust drift.
Going forward it'd be the harness and guardrails around the agent that can prevent disasters: sandboxes, write-fences, rollback paths.
-
Agent token cost grows quadratically in turns without caching, roughly linearly with caching. A new post fits those curves to SWE-bench traces on three models. Cross-model finding shows something interesting: Gemini 3 Flash takes 2× as many turns as GPT-5.2 or Opus 4.6, so its leaner per-turn verbosity (~300 tokens vs ~1,000) still burns more total tokens.
-
Agent token cost grows quadratically in turns without caching, roughly linearly with caching. A new post fits those curves to SWE-bench traces on three models. Cross-model finding shows something interesting: Gemini 3 Flash takes 2× as many turns as GPT-5.2 or Opus 4.6, so its leaner per-turn verbosity (~300 tokens vs ~1,000) still burns more total tokens.
-
Agent token cost grows quadratically in turns without caching, roughly linearly with caching. A new post fits those curves to SWE-bench traces on three models. Cross-model finding shows something interesting: Gemini 3 Flash takes 2× as many turns as GPT-5.2 or Opus 4.6, so its leaner per-turn verbosity (~300 tokens vs ~1,000) still burns more total tokens.
-
Agent token cost grows quadratically in turns without caching, roughly linearly with caching. A new post fits those curves to SWE-bench traces on three models. Cross-model finding shows something interesting: Gemini 3 Flash takes 2× as many turns as GPT-5.2 or Opus 4.6, so its leaner per-turn verbosity (~300 tokens vs ~1,000) still burns more total tokens.
-
Agent token cost grows quadratically in turns without caching, roughly linearly with caching. A new post fits those curves to SWE-bench traces on three models. Cross-model finding shows something interesting: Gemini 3 Flash takes 2× as many turns as GPT-5.2 or Opus 4.6, so its leaner per-turn verbosity (~300 tokens vs ~1,000) still burns more total tokens.
-
The Inference Shift: Ben Thompson splits "inference" into two workloads. Answer inference (human waiting) stays on premium GPUs; agentic inference (no human waiting) migrates to commodity memory hierarchy. Familiar shape: the 70s batch-off-mainframes migration may rerun on today's GPU clusters.
https://benjaminhan.net/posts/20260511-the-inference-shift/?utm_source=mastodon&utm_medium=social
-
The Inference Shift: Ben Thompson splits "inference" into two workloads. Answer inference (human waiting) stays on premium GPUs; agentic inference (no human waiting) migrates to commodity memory hierarchy. Familiar shape: the 70s batch-off-mainframes migration may rerun on today's GPU clusters.
https://benjaminhan.net/posts/20260511-the-inference-shift/?utm_source=mastodon&utm_medium=social
-
The Inference Shift: Ben Thompson splits "inference" into two workloads. Answer inference (human waiting) stays on premium GPUs; agentic inference (no human waiting) migrates to commodity memory hierarchy. Familiar shape: the 70s batch-off-mainframes migration may rerun on today's GPU clusters.
https://benjaminhan.net/posts/20260511-the-inference-shift/?utm_source=mastodon&utm_medium=social
-
The Inference Shift: Ben Thompson splits "inference" into two workloads. Answer inference (human waiting) stays on premium GPUs; agentic inference (no human waiting) migrates to commodity memory hierarchy. Familiar shape: the 70s batch-off-mainframes migration may rerun on today's GPU clusters.
https://benjaminhan.net/posts/20260511-the-inference-shift/?utm_source=mastodon&utm_medium=social
-
The Inference Shift: Ben Thompson splits "inference" into two workloads. Answer inference (human waiting) stays on premium GPUs; agentic inference (no human waiting) migrates to commodity memory hierarchy. Familiar shape: the 70s batch-off-mainframes migration may rerun on today's GPU clusters.
https://benjaminhan.net/posts/20260511-the-inference-shift/?utm_source=mastodon&utm_medium=social
-
Singapore Researchers Harmonize Diverse SIEMs with Agentic Rule Translation
Imagine having multiple Security Information and Event Management platforms working in perfect harmony - Singapore researchers have made this a reality by developing a game-changing approach called agentic rule translation, enabling seamless interoperability between diverse SIEMs.…
#SiemInteroperability #AgenticSystems #SecurityInformationAndEventManagement #Singapore #ResearchAndDevelopment
-
Is AI displacement of white-collar work happening, or is the narrative ahead of the data?
Two late-February pieces gave opposite answers. Citrini Research imagined a 2028 market crash from white-collar AI substitution. Citadel Securities answered two days later: adoption flat, job postings up, and demand collapse needs too many things to go wrong at once.
Which side is right? Perhaps a 160-year-old paradox about coal can help answer.
-
Is AI displacement of white-collar work happening, or is the narrative ahead of the data?
Two late-February pieces gave opposite answers. Citrini Research imagined a 2028 market crash from white-collar AI substitution. Citadel Securities answered two days later: adoption flat, job postings up, and demand collapse needs too many things to go wrong at once.
Which side is right? Perhaps a 160-year-old paradox about coal can help answer.
-
Is AI displacement of white-collar work happening, or is the narrative ahead of the data?
Two late-February pieces gave opposite answers. Citrini Research imagined a 2028 market crash from white-collar AI substitution. Citadel Securities answered two days later: adoption flat, job postings up, and demand collapse needs too many things to go wrong at once.
Which side is right? Perhaps a 160-year-old paradox about coal can help answer.
-
Is AI displacement of white-collar work happening, or is the narrative ahead of the data?
Two late-February pieces gave opposite answers. Citrini Research imagined a 2028 market crash from white-collar AI substitution. Citadel Securities answered two days later: adoption flat, job postings up, and demand collapse needs too many things to go wrong at once.
Which side is right? Perhaps a 160-year-old paradox about coal can help answer.
-
Is AI displacement of white-collar work happening, or is the narrative ahead of the data?
Two late-February pieces gave opposite answers. Citrini Research imagined a 2028 market crash from white-collar AI substitution. Citadel Securities answered two days later: adoption flat, job postings up, and demand collapse needs too many things to go wrong at once.
Which side is right? Perhaps a 160-year-old paradox about coal can help answer.
-
Whoa, hold onto your propeller hats, folks! 🤓 We've got a #GitHub project that's apparently a self-modifying, open-sourced "agentic system" (whatever that means) living in consumer hardware. Because clearly what we all need is more inscrutable tech jargon disguised as innovation! 🤖🔧
https://github.com/ninjahawk/hollow-agentOS #Innovation #SelfModifyingTech #AgenticSystems #ConsumerHardware #TechJargon #HackerNews #ngated -
Whoa, hold onto your propeller hats, folks! 🤓 We've got a #GitHub project that's apparently a self-modifying, open-sourced "agentic system" (whatever that means) living in consumer hardware. Because clearly what we all need is more inscrutable tech jargon disguised as innovation! 🤖🔧
https://github.com/ninjahawk/hollow-agentOS #Innovation #SelfModifyingTech #AgenticSystems #ConsumerHardware #TechJargon #HackerNews #ngated -
Whoa, hold onto your propeller hats, folks! 🤓 We've got a #GitHub project that's apparently a self-modifying, open-sourced "agentic system" (whatever that means) living in consumer hardware. Because clearly what we all need is more inscrutable tech jargon disguised as innovation! 🤖🔧
https://github.com/ninjahawk/hollow-agentOS #Innovation #SelfModifyingTech #AgenticSystems #ConsumerHardware #TechJargon #HackerNews #ngated