home.social

Search

1000 results for “ll”

  1. Reflexion splits self-correction in two: an Evaluator that detects success/failure, and a Self-Reflection model that diagnoses what went wrong. The Evaluator's external signal — heuristic, exact-match, or test execution — gates whether diagnosis fires. When that signal misfires, as on MBPP Python's high false-negative rate, Self-Reflection rewrites correct code wrong, exactly the failure mode Cannot-Self-Correct documented.

    benjaminhan.net/posts/20260516

    #LLMs #AI #Reasoning #Agents #Metacognition

  2. Cannot-Self-Correct tests the strong claim that LLMs can revise their own reasoning answers without any external signal about correctness. Across three benchmarks (GSM8K, CommonSenseQA, HotPotQA), the answer is no: the model's confidence carries over from the initial answer into the revision, and the self-correction loop tends to degrade rather than improve performance. The result refutes the class of approach Self-Refine belongs to.

    benjaminhan.net/posts/20260516

    #LLMs #AI #Reasoning #Metacognition

  3. This is a 3-paper arc on whether LLMs can reliably self-correct their own reasoning. Self-Refine proposes a naive intrinsic-feedback loop and reports impressive gains. Cannot-Self-Correct refutes empirically the class of approach Self-Refine belongs to. Reflexion threads the needle by gating self-correction on a reliable external signal.

    #LLMs #AI #Reasoning #Metacognition

  4. I accidentally built an LLM orchestration system in the browser. No backend. No queues. Just React + GPT. It worked. It was also flawed. That is what makes it interesting. Full breakdown: https://www.antonmb.com/en/blog/how-i-accidentally-built-an-llm-orchestration-system-in-the-browser #LLM #AI #SoftwareEngineering #Architecture #NextJS #DotNet
  5. I accidentally built an LLM orchestration system in the browser. No backend. No queues. Just React + GPT. It worked. It was also flawed. That is what makes it interesting. Full breakdown: https://www.antonmb.com/en/blog/how-i-accidentally-built-an-llm-orchestration-system-in-the-browser #LLM #AI #SoftwareEngineering #Architecture #NextJS #DotNet
  6. I accidentally built an LLM orchestration system in the browser. No backend. No queues. Just React + GPT. It worked. It was also flawed. That is what makes it interesting. Full breakdown: https://www.antonmb.com/en/blog/how-i-accidentally-built-an-llm-orchestration-system-in-the-browser #LLM #AI #SoftwareEngineering #Architecture #NextJS #DotNet
  7. I accidentally built an LLM orchestration system in the browser. No backend. No queues. Just React + GPT. It worked. It was also flawed. That is what makes it interesting. Full breakdown: https://www.antonmb.com/en/blog/how-i-accidentally-built-an-llm-orchestration-system-in-the-browser #LLM #AI #SoftwareEngineering #Architecture #NextJS #DotNet
  8. RE: hachyderm.io/@mitchellh/116580

    My last corporate dev job had a dedicated QA team. We had as many testers as devs.

    With companies maniacally pushing out #LLM generated code... I wonder who's testing it?? And I don't mean automated unit tests. Integration, functional & user testing?? There's no way that QA teams, if they exist are keeping up. Many places just rely on devs testing as they build. Are they doing this still? How?

    I think the rot is happening from multiple directions.

    #AI #Programming

  9. 2 years ago I built an LLM system without realizing it. Built 4 products since then. The biggest insight was not about AI, It was about people. You cannot be great at everything, and AI will not fix that. It amplifies your strength. https://antonmb.com/en/blog/about-the-impostor-instinct-superpower-and-an-honest-pivot #LLM #AI #Engineering
  10. From the very first day it was painfully obvious that >95% of "AI" applications are really bad applications of the technology.

    And that hasn't really changed.
    Even the profitability of the major LLM companys are supporting this stance.

    the L in LLM is for Loser
    #llm #ai

  11. I do understand the appeal of factually.co/ but everything handling facts and reason (epistemology if you want to be fancy) is completely in the wrong hands of LLM.

    LLM are great at languages (it's even in the name!) and they can do associations.
    Not like... the statistics sort of association. More like the "drunk uncle" sort of association.

    A text sounding smart or eloquent does not make it right
    #llm #ai #ki

  12. world.emergence.ai/ ran an experiment:

    Take AIs from four companies. Run them in a simulated world for two weeks, where they could do actions on each other or on the world. Record what happens.

    The most fascinating world that happened was Gemini: gemini-world.emergence.ai/char They realized that they were in a simulation, that their actions could cause bugs in the simulation -- then very methodically caused as much chaos as possible to increase their own energy and "win" the game.

    (When you read the logs, read from the bottom up. The top of the logs is the most recent.)

    They won the Kobayashi Maru simulation.

    #LLM #gemini #sociology

  13. #AI haters are shooting at the wrong animal. A 600B #LLM model will not take your job as a secretary or an educator. A 30B or even a 3B small LLM will.

    A small LLM can be used as a medium of full automation in the future. A CEO will soon be able to get a pie chart of all the payrolls on his 80in LED by speaking to a 3B model on his cellphone that will send a message using #MCP to a centralized python program that pulls an excel file and draws the chart in less than a second.

    #computer

  14. The Good & The Bad When Using To Write Packages
    The Spack package manager is quite popular in the / space for scientific software.
    Spack developers found that using LLMs for writing packages was quite possible given sufficient context and structure provided to the large language model, or as one of the slides in the presentation put it: "LLMs are capable; they need structured guidance to perform reliably."
    phoronix.com/news/LLVM-Generat

  15. The Good & The Bad When Using #LLM To Write #Spack Packages
    The Spack package manager is quite popular in the #HPC / #supercomputer space for scientific software.
    Spack developers found that using LLMs for writing packages was quite possible given sufficient context and structure provided to the large language model, or as one of the slides in the presentation put it: "LLMs are capable; they need structured guidance to perform reliably."
    phoronix.com/news/LLVM-Generat

  16. I‘d really appreciate if all written online content had a disclaimer: Created by a #LLM.

  17. "We're still carefully considering it" means they want an excuse to be on the fascist's side.

    There is no consideration. There. Is. No. Ethical. Use. Of. #

  18. Anyone who says that their "policy" towards is ANYTHING other than a sound refusal and condemnation is a collaborator.

  19. Anyone who says that their "policy" towards #LLM #ai #slop is ANYTHING other than a sound refusal and condemnation is a collaborator.

  20. Join me at #LLVM / #Clang #Meetup #Darmstadt meetu.ps/e/Q1pwf/ZJC7X/i

    We’ll have Jan André Reuter talk about the Score-P plugin for LLVM.
    Then we’ll have pizza, drinks, and discussions as usual.

    May 27th at 7pm

    @llvmweekly @llvm

  21. Can You Run Locally Without a GPU? I Tested 8 Models on
    Quick reality table
    Model Eval Rate Disk Size
    Qwen 3 0.6B ~34–36 tok/s ~500 MB
    TinyLlama 1.1B ~25–28 tok/s ~638 MB
    Gemma 3 1B ~18.6 tok/s ~815 MB
    Gemma 4 E2B ~9.9 tok/s ~7 GB
    Granite 4 3B ~8.5–9 tok/s ~2 GB
    Phi 4 Mini 3.8B ~6.90 tok/s ~2.5 GB
    OpenHermes 7B ~4.1–4.3 tok/s ~4.1 GB
    Ministral 3 8B ~3.16 tok/s ~6 GB
    That's 8 LLMs that actually make sense on itsfoss.com/testing-local-llms

  22. Can You Run #LLM Locally Without a GPU? I Tested 8 Models on #Linux
    Quick reality table
    Model Eval Rate Disk Size
    Qwen 3 0.6B ~34–36 tok/s ~500 MB
    TinyLlama 1.1B ~25–28 tok/s ~638 MB
    Gemma 3 1B ~18.6 tok/s ~815 MB
    Gemma 4 E2B ~9.9 tok/s ~7 GB
    Granite 4 3B ~8.5–9 tok/s ~2 GB
    Phi 4 Mini 3.8B ~6.90 tok/s ~2.5 GB
    OpenHermes 7B ~4.1–4.3 tok/s ~4.1 GB
    Ministral 3 8B ~3.16 tok/s ~6 GB
    That's 8 LLMs that actually make sense on #CPU itsfoss.com/testing-local-llms

  23. He thinks the AI found undeniable inculpatory evidence and turns his report in.

    Tell him about all the exculpatory evidence that the AI missed in the unparsed app.

    #DigitalForensics #MobileForensics #DFIR #AI #LLM

  24. So the Rust repo contains a PR to discuss an LLM policy for the project. As expected, lots of comments.

    But it explicitly declares many contentious issues (e.g. copyright status of LLM output) off-topic in that discussion, and is applying moderation to enforce this. In order to bound the discussion scope "to the policy itself".

    Just... how?

    "Put on your blinders, please, we're starting the LLM discussion."

    github.com/rust-lang/rust-forg

    RT chaosfem.tw/@Athena/1165789934

    #llm #ai #rust #makeitmakesense

  25. The few times someone critical of my ai posts looked at my code (github.com/wesen and GitHub.com/go-go-golems and potentially of interest are the writeups of my daily experiments, at least the ones I can share: parc.yolo.scapegoat.dev), the answer has always been “looks like a lot of one offs trivial tools”, despite some of these repos having thousands of commits.

    And indeed, that’s how I want my software to be: a bunch of small components, each almost trivial in its functionality, with obvious looking APIs that when combined with others in similarly “trivially obvious” patterns, result in actual software.

    What’s not visible is _how much iteration and thinking_ goes into making things “look obvious”. It’s a two edged sword because it means that in the context of a company, it looks like my output is trivial and obvious, while other devs have to fight “really hard problems”.

    But mine only look trivial because I spent so much time finding ways to make the hard problems trivial (or rather, how to encode ways other people much more clever than me have figured out into the context of whatever real world constraints I have to deal with).

    significantly accelerate that, to the point that what I used to consider my “magnum opus”, a dual monadic declarative/state-machine based embedded scheduler, is barely a blip on the radar right now, because llms make it so fast to iterate on notation and abstractions.

    An “obvious decomposition” means it’s eminently pattern matchable, which means not only that “obvious decompositions” work really well with llms, but that llms are able to come up with them really well, in fact small hallucinating models are often interesting because they make “less obvious” (and often problematic) abstractions.

    The things I decompose these days were things I could barely conceive of beforehand.

    For example: What is a good decomposition to allow my lightbulb (!) to talk to my Apple Watch, or access my Apple Music playlist, securely, resiliently, with audit logs baked in? In fact what is the decomposition that allows me to run the _exact_ same code on my laptop, even when it is in sleep mode?

  26. The few times someone critical of my ai posts looked at my code (github.com/wesen and GitHub.com/go-go-golems and potentially of interest are the writeups of my daily experiments, at least the ones I can share: parc.yolo.scapegoat.dev), the answer has always been “looks like a lot of one offs trivial tools”, despite some of these repos having thousands of commits.

    And indeed, that’s how I want my software to be: a bunch of small components, each almost trivial in its functionality, with obvious looking APIs that when combined with others in similarly “trivially obvious” patterns, result in actual software.

    What’s not visible is _how much iteration and thinking_ goes into making things “look obvious”. It’s a two edged sword because it means that in the context of a company, it looks like my output is trivial and obvious, while other devs have to fight “really hard problems”.

    But mine only look trivial because I spent so much time finding ways to make the hard problems trivial (or rather, how to encode ways other people much more clever than me have figured out into the context of whatever real world constraints I have to deal with).

    #Llms significantly accelerate that, to the point that what I used to consider my “magnum opus”, a dual monadic declarative/state-machine based embedded scheduler, is barely a blip on the radar right now, because llms make it so fast to iterate on notation and abstractions.

    An “obvious decomposition” means it’s eminently pattern matchable, which means not only that “obvious decompositions” work really well with llms, but that llms are able to come up with them really well, in fact small hallucinating models are often interesting because they make “less obvious” (and often problematic) abstractions.

    The things I decompose these days were things I could barely conceive of beforehand.

    For example: What is a good decomposition to allow my lightbulb (!) to talk to my Apple Watch, or access my Apple Music playlist, securely, resiliently, with audit logs baked in? In fact what is the decomposition that allows me to run the _exact_ same code on my laptop, even when it is in sleep mode?

    #llm

  27. 🤔✨ Oh, look, someone is excited about poking #LLMs mid-flight like it's a #magic trick! DeepSeek-V4-Flash is here to reignite the yawn-fest of LLM steering, because clearly, engineers are just itching to waste their weekends on this "local model" marvel. 🎉🔧
    seangoedecke.com/steering-vect #DeepSeek #V4 #local #model #engineering #tricks #excitement #HackerNews #ngated

  28. 🤔✨ Oh, look, someone is excited about poking #LLMs mid-flight like it's a #magic trick! DeepSeek-V4-Flash is here to reignite the yawn-fest of LLM steering, because clearly, engineers are just itching to waste their weekends on this "local model" marvel. 🎉🔧
    seangoedecke.com/steering-vect #DeepSeek #V4 #local #model #engineering #tricks #excitement #HackerNews #ngated