home.social

#evaluation — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #evaluation, aggregated by home.social.

  1. Most evaluations begin with questions, a pattern reflected in typical ToRs. Routine questions, to be answered one by one, delivering recommendations.
    But now AI does this faster, cheaper. Plausible and convincing.
    If evaluation remains a standardised Q&A, humans offer little added value.
    The problem predates AI. We should have known that a sharper question, a reframed problem, is itself a finding.
    AI is good at answers. The human contribution is the question worth asking.

    #evaluation #AI

  2. @sayzard This looks like a tool that could actually make LLM evaluation less painful. Cache-aware, cost-aware, and graph-based? Sign me up for the future of AI dev workflows. #LLM #Evaluation #RAG #AIDev

  3. @sayzard This looks like a tool that could actually make LLM evaluation less painful. Cache-aware, cost-aware, and graph-based? Sign me up for the future of AI dev workflows. #LLM #Evaluation #RAG #AIDev

  4. @sayzard This looks like a tool that could actually make LLM evaluation less painful. Cache-aware, cost-aware, and graph-based? Sign me up for the future of AI dev workflows. #LLM #Evaluation #RAG #AIDev

  5. SelfReflect measures whether an LLM's text summary of its uncertainty matches its actual answer distribution. Across 20 modern models: it doesn't, unless the model sees samples of its own answers first.

    The negative result does more work than the metric itself. Fits a growing line where LLM self-reports shouldn't be trusted as introspection. Practical workaround isn't cheap: N forward passes to sample, then a summarize pass.

    benjaminhan.net/posts/20260430

    #LLMs #AI #Evaluation #Apple #ICLR

  6. KGLens turns a knowledge graph into test questions and uses Thompson sampling to zero in on a model's weakest facts.

    The interesting bit is the output shape: a per-relation map of where the model is and isn't reliable, against a graph matched to your deployment. Sampling trick should generalize to red-teaming, jailbreak coverage, capability probing too.

    benjaminhan.net/posts/20260430

    #LLMs #AI #Evaluation #Apple #KnowledgeGraphs #ACL

  7. NEW BIML Bibliography entry

    arxiv.org/abs/2510.23166

    Common Task Framework For a Critical Evaluation of Scientific Machine Learning Algorithms

    Philippe Martin Wyder, et al

    Overly focused on two particular problems, and the authors build unique metrics for each one (forecasting). Polluted by ImageNet paradigm and protein folding...both closed domains. Bottom line: lots of deep confusion about models.

    #DON'TBOTHER #Evaluation #MLsec

    berryvilleiml.com/bibliography/

  8. 🥰 Ever the romantic, I look under the covers at the calculus of matchmaking.

    philosophics.blog/2026/04/16/d

    It should be no surprise to most that dating is not an economic endeavour (with some obvious exceptions). In this blog post and associated podcast, I articulate why.

    #philosophy #economics #psychology #signal #noise #markets #attributes #efficiency #optimisation #dating #shopping #wants #needs #menus #heuristics #evaluation #romance #preferences #blog #podcast #expectations #value #reality

  9. SECUSO Research @SECUSO_Research@bawü.social ·

    📰 Der Artikel "S/MIME verstehen in 5 Minuten - Entwicklung und Evaluation eines Erklärvideos" von Fabian Ballreich und Melanie Volkamer ist in der Zeitschrift #Datenschutz und #Datensicherheit (DuD) erschienen. In dem Artikel werden die Entwicklung eines Erklärvideos zum Thema S/MIME sowie die anschließende #Evaluation mit Experten aus dem Bereich #Informationssicherheit beschrieben: link.springer.com/journal/1162

  10. Auch wenn die Rückmeldung mit 4 von 29 Teilnehmer*innen leider nicht wirklich repräsentativ war, freue ich mich immer besonders, wenn sich Studierende bei der Evaluation doch die Mühe persönlichen Feedbacks machen. /1 #bluelz #games #Literatur #Lehramt #Lehre #Rückmeldung #Evaluation #Feedback

  11. Request for proposals are necessary to see who is willing to provide a product or service for you by shopping it to companies, and not always for the most competitive price as there can be other factors. #business #supplier #evaluation #rfp #proposal

  12. [Formation]

    📅 Rejoignez dès à présent notre nouvelle formation sur les enjeux évaluatifs qui se tiendra le jeudi 28 mai 2026 qui se déroulera en présentiel au siège du F3E à Paris (17 rue de Châteaudun, 75009, Paris).

    👉 Infos et inscriptions : lnkd.in/eabuWB2M

    #Formation #Evaluation #ONG #Solidarité #SecteurAssociatif #ESS

  13. Reminder that the deadlines for the IEEE Engineering Reliable Autonomous Systems Conference 2026 in Zagreb, Croatia (May 28-29, just before ICRA in Vienna) are coming up!

    Feb 21: Regular and short papers
    Feb 28: Workshop and tutorial proposals
    Mar 31: Late-breaking reports

    Stakeholders across all autonomous system domains and practices are welcome!

    2026-erasrobotics.org/index.ht

    #ERAS2026 #ReliableSystems #AutonomousSystems #Robotics #Conference #CfP #Verification #Testing #Specification #Evaluation #Reliability #Autonomy #Zagreb #IEEE

  14. Le 4 février participez à la restitution de l’étude d’effets et d’impacts du programme "Mendihuaca" mené par Tchendukua – Ici et Ailleurs, sur la restitution de #TerresAncestrales, la protection de la #Biodiversité et le dialogue Sud / Nord (Colombie – France).

    🗓️ 4 février
    🕑 15h-17h (GMT+1)
    📍 En ligne
    👉 Infos et inscriptions : reseauf3e.org/activite/restitu

    #Restitution #Etude #Evaluation #ONG #SolidaritéInternationale

  15. State and local leaders say
    💥they do not believe that the #FBI #investigation of the shooting death of #Renee #Nicole #Good will be fair and impartial,

    -- and are sounding alarms about the impact of federal officials
    🆘 holding onto evidence in a potential prosecution of the ICE agent who killed her.

    Minnesota’s lead investigative agency,
    the "Bureau of Criminal Apprehension" ( #BCA ),
    initially began investigating the shooting in conjunction with the FBI.

    But the BCA issued a statement Thursday morning saying that
    “the US attorney’s office had reversed course:
    the investigation would now be led solely by the FBI,
    and the BCA would no longer have access to the case materials,
    scene evidence or investigative interviews necessary to complete a thorough and independent investigation”.

    Hennepin county attorney
    #Mary #Moriarty,
    an elected Democrat and the county’s prosecutor, clarified at a press conference Friday that
    the BCA
    – which was established in 1927
    – has a very high investigative standard
    💥and that this standard can’t be met when the organization doesn’t have access to all the evidence.
    It does not preclude an investigation, she said.
    But a lack of access to evidence hampers the investigation.

    “When the BCA came to the scene,
    the evidence had been taken by the FBI,” she said.
    “They collected the car and took it wherever the BCA does not have access to the car.
    And the problem isn’t that the FBI took the car,
    it’s that the BCA doesn’t have access to the car, or right now,
    even access to the #forensic #evaluation that happens as a result of the investigation with that car.”

    theguardian.com/us-news/2026/j

  16. As always, #OpenData and #OpenCode are persistently available:
    Havrylash, J., & Schöch, C. (2025). Syntetic texts evaluation with #pydistinto. Zenodo. 10.5281/zenodo.15525428.
    And the article: doi.org/10.48694/jcls.4209
    #JCLS #CCLS205 #LiteraryComputing #NLG #Evaluation

  17. btw.:
    ChatGPT zum 1. #Zwischenbericht der #Evaluation des
    #Konsumcannabisgesetzes
    (#EKOCAN):

    📊 Fazit

    Der Zwischenbericht sieht keinen akuten Reformbedarf, betont aber, dass die gesetzgeberischen Ziele – insbesondere die Verdrängung des Schwarzmarkts – bislang nicht erreicht werden. Die Teillegalisierung hat viele positive Effekte auf Gesundheitsschutz, Entkriminalisierung und Prävention, weist aber noch strukturelle Schwächen im Markt- und Regulierungsdesign auf.

    #weedmob #legalized

  18. A quotation from Marcus Aurelius

    Never regard something as doing you good if it makes you betray a trust, or lose your sense of shame, or makes you show hatred, suspicion, ill will, or hypocrisy, or a desire for things best done behind closed doors.
     
    [Μὴ τιμήσῃς ποτὲ ὡς συμφέρον σεαυτοῦ, ὃ ἀναγκάσει σέ ποτε τὴν πίστιν παραβῆναι, τὴν αἰδῶ ἐγκαταλιπεῖν, μισῆσαί τινα, ὑποπτεῦσαι, καταράσασθαι, ὑποκρίνασθαι, ἐπιθυμῆσαί τινος τοίχων καὶ παραπετασμάτων δεομένου.]

    Marcus Aurelius (AD 121-180) Roman emperor (161-180), Stoic philosopher
    Meditations [To Himself; Τὰ εἰς ἑαυτόν], Book 3, ch. 7 (3.7) [tr. Hays (2003)]

    Sourcing, notes, alternate translations: wist.info/marcus-aureleus/2675…

    #quote #quotes #quotation #advantage #benefit #betrayal #corruption #dishonesty #embarrassment #evaluation #hatred #hypocrisy #immorality #insincerity #integrity #lying #profit #secrecy #selfrespect #suspicion #vice

  19. A quotation from Marcus Aurelius

    Never regard something as doing you good if it makes you betray a trust, or lose your sense of shame, or makes you show hatred, suspicion, ill will, or hypocrisy, or a desire for things best done behind closed doors.
     
    [Μὴ τιμήσῃς ποτὲ ὡς συμφέρον σεαυτοῦ, ὃ ἀναγκάσει σέ ποτε τὴν πίστιν παραβῆναι, τὴν αἰδῶ ἐγκαταλιπεῖν, μισῆσαί τινα, ὑποπτεῦσαι, καταράσασθαι, ὑποκρίνασθαι, ἐπιθυμῆσαί τινος τοίχων καὶ παραπετασμάτων δεομένου.]

    Marcus Aurelius (AD 121-180) Roman emperor (161-180), Stoic philosopher
    Meditations [To Himself; Τὰ εἰς ἑαυτόν], Book 3, ch. 7 (3.7) [tr. Hays (2003)]

    Sourcing, notes, alternate translations: wist.info/marcus-aureleus/2675…

    #quote #quotes #quotation #advantage #benefit #betrayal #corruption #dishonesty #embarrassment #evaluation #hatred #hypocrisy #immorality #insincerity #integrity #lying #profit #secrecy #selfrespect #suspicion #vice

  20. A quotation from Marcus Aurelius

    Never regard something as doing you good if it makes you betray a trust, or lose your sense of shame, or makes you show hatred, suspicion, ill will, or hypocrisy, or a desire for things best done behind closed doors.
     
    [Μὴ τιμήσῃς ποτὲ ὡς συμφέρον σεαυτοῦ, ὃ ἀναγκάσει σέ ποτε τὴν πίστιν παραβῆναι, τὴν αἰδῶ ἐγκαταλιπεῖν, μισῆσαί τινα, ὑποπτεῦσαι, καταράσασθαι, ὑποκρίνασθαι, ἐπιθυμῆσαί τινος τοίχων καὶ παραπετασμάτων δεομένου.]

    Marcus Aurelius (AD 121-180) Roman emperor (161-180), Stoic philosopher
    Meditations [To Himself; Τὰ εἰς ἑαυτόν], Book 3, ch. 7 (3.7) [tr. Hays (2003)]

    Sourcing, notes, alternate translations: wist.info/marcus-aureleus/2675…

    #quote #quotes #quotation #advantage #benefit #betrayal #corruption #dishonesty #embarrassment #evaluation #hatred #hypocrisy #immorality #insincerity #integrity #lying #profit #secrecy #selfrespect #suspicion #vice

  21. In case you are into Christmas and estimands, maybe this article is for you?
    link.springer.com/article/10.1

    "In reality, the statistical elf would have found the sample size was too small (nine reindeers) to estimate these estimands reliably using the discussed methods."

    #RCT #HRQL #ResearchDesign #Evaluation

  22. Back in the days of 2021
    there was a lovely evaluation paper:
    Automatically identifying label errors
    Improving score's reliability
    Finding example's difficulty
    Active Learning

    aclanthology.org/2021.acl-long

    @par @hoyle
    #machinelearning #evaluation #IRT #LLM #deepRead

  23. 'Comprehensive Algorithm Portfolio Evaluation using Item Response Theory', by Sevvandi Kandanaarachchi, Kate Smith-Miles.

    jmlr.org/papers/v24/20-1318.ht

    #classification #irt #evaluation