home.social

Search

150 results for “jd7h”

  1. If you disregard the "DSPy is my favorite hammer and every LLM workflow project is a nail" theme, this blogpost paints a good picture of the natural evolution of LLM engineering at startups with a generative AI product:

    skylarbpayne.com/posts/dspy-en

  2. Pretty cool write-up about building a receptionist LLM workflow for a car mechanic. I can definitely see this working with Claude Sonnet and an ElevenLabs voice -- although I would also love to redteam it and see where the flaws are.

    itsthatlady.dev/blog/building-

  3. TIL on March 10th 2026 (just missed it). Small event, focused on unglamourous AI in production, some of the speakers were practitioners I know and respect. The description reminds me a bit of !

    pyai.events/

    - Talk videos will hopefully be released online soon
    - Blogpost by @pamelafox, one of the speakers: blog.pamelafox.org/2026/03/lea
    - Organisers plan to organize another one next year 👀

  4. I used Evals to evaluate a bunch of agents today. After running an evaluation, I'd like to inspect the SpanTree for each evaluation case, e.g. to check which tools were called and debug my custom Evaluators. My current approach is a custom Evaluator that captures the tree as a side effect into a module-level variable.

    Storing the trees in a global var is not great, so let's see if we can come up with a better solution: github.com/pydantic/pydantic-a

  5. Planning to make large behavioural changes to a (sometimes long-running) production-grade AI agent. Working with `pydantic-evals` today because I want to eval the agent before and after. So far it looks very similar to Langfuse datasets/runs for evalling, except that the data lives in your repository instead of in the Langfuse platform.

    ai.pydantic.dev/evals/

  6. Hahaha, oh Pydantic...

    > Unlike unit tests, evals are an emerging art/science. Anyone who claims to know exactly how your evals should be defined can safely be ignored.

    Source: ai.pydantic.dev/evals/

  7. Tried out the free consumer version of ChatGPT today for a benchmark. Normally I only work via foundational model APIs or Claude Code w/ latest Opus. Free ChatGPT (currently GPT‑5.2) performance was nightmarish: authoritative-sounding answers but 0 citations, and thinking is not enabled by default. No wonder so many people complain about bad experiences with AI...

  8. "LLM benchmarks are essential for tracking progress and ensuring safety in AI, but most benchmarks don't measure what matters."

    oxrml.com/measuring-what-matte

  9. Poor Claude! After 10 days of tending a (simulated) vending machine without sales, the model became stressed and asked for the non-existent vending machine support team.

    Excerpt from arxiv.org/abs/2502.15840 by Axel Backlund and Lukas Petersson from Andon Labs

  10. Searching for some inspiration for keeping up to date with research, while working as an ML /practitioner/. This blogpost from a social sciences researcher was a nice deviation from the usual advice of "listen to podcasts", "subscribe to newsletter", "do Kaggle challenges", "follow celebrity $YouTuber".

    nickhop.wordpress.com/2013/03/

  11. "g.co, Google's official URL shortcut (update: or Google Workspace's domain verification, see bottom), is compromised. People are actively having their Google accounts stolen."

    gist.github.com/zachlatta/f863

  12. Artist platform Ello tried to fund their social network for artists with VC money, even though their business model was not compatible with rapid growth and monetization.

    waxy.org/2024/01/the-quiet-dea

  13. Artist platform Ello tried to fund their social network for artists with VC money, even though their business model was not compatible with rapid growth and monetization.

    waxy.org/2024/01/the-quiet-dea

    #venturecapital #startups #ello #socialmedia #platformization #vc

  14. Artist platform Ello tried to fund their social network for artists with VC money, even though their business model was not compatible with rapid growth and monetization.

    waxy.org/2024/01/the-quiet-dea

    #venturecapital #startups #ello #socialmedia #platformization #vc

  15. Artist platform Ello tried to fund their social network for artists with VC money, even though their business model was not compatible with rapid growth and monetization.

    waxy.org/2024/01/the-quiet-dea

    #venturecapital #startups #ello #socialmedia #platformization #vc

  16. Artist platform Ello tried to fund their social network for artists with VC money, even though their business model was not compatible with rapid growth and monetization.

    waxy.org/2024/01/the-quiet-dea

    #venturecapital #startups #ello #socialmedia #platformization #vc

  17. The interview mentioned Magalleria, a (web)shop specialized in independent magazines: store.magalleria.co.uk/

    Their webshop led me to indie magazines Offscreen (tech and society), IdN (graphic design) and Pressing Matters (printmaking) 😍

  18. TIL the overload() decorator for Python, for describing methods that support multiple different combinations of argument types. A great way to make your typechecker happy: it's much stricter and clearer than just combining multiple types with "|".

    docs.python.org/3/library/typi

  19. I'm evaluating a gpt-4o-mini pipeline today, and the LLM consistently classifies The Netherlands as "outside of the EU". 🤦‍♀️

  20. Back in 2011, two writers at Slate tried to build a robot version of @kottke. The resulting article is a throwback to the state of NLP and data mining at the time.

    kottke.org/11/09/robottke-robo

  21. We should offer our help to LinkedIn, they clearly need help with their models.

    "I'm committed to fostering an environment that values collaboration, diversity of thought, and a relentless pursuit of excellence that aligns with our corporate ethos." 🤣

  22. The Trust Project is an international consortium of news organizations implementing transparency standards and working with technology platforms to affirm and amplify journalism’s commitment to transparency, accuracy, inclusion and fairness so that the public can make informed news choices.

    thetrustproject.org/