home.social

#adversarialml — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #adversarialml, aggregated by home.social.

  1. ----------------

    🎯 AI
    ===================

    AI red teaming applies adversarial methodology to large language models, exposing vulnerabilities that traditional security testing misses. The core problem: models like GPT, Claude, and Gemini reason in ways that fail unpredictably, without triggering alerts.

    Why traditional testing falls short

    Standard application security focuses on code vulnerabilities. LLMs introduce a different risk category. The model interprets language, and an attacker manipulates that interpretation rather than exploiting a logic bug. A simple prompt modification can bypass safety controls, extract training data, or produce harmful outputs. No alert fires.

    The Microsoft Copilot example

    Researchers demonstrated that Microsoft Copilot could be compromised through a single malicious email. This shows how AI-integrated business tools inherit model vulnerabilities and expose them to external manipulation. The model's ability to process email content becomes an attack vector.

    Red teaming methodology

    1. Scope definition: Establish rules of engagement. Specify in-scope targets and off-limits areas.

    2. Scenario design: Map the AI attack surface. Identify adversary paths, from data pipelines to prompt interfaces.

    3. Attack planning: Select tactics based on threat analysis. Options include prompt injection, data poisoning, and adversarial inputs.

    4. Execution: Launch attacks in sandboxed environments. Combine manual probing with automation. Monitor anomalies and document evidence.

    5. Reporting: Deliver comprehensive assessment with attack narratives. This provides organizations with a prioritized remediation roadmap.

    Common techniques
    • Prompt injection: Embedding malicious instructions in user input to hijack model control logic and override system prompts.
    • Data exfiltration: Tricking the model into revealing training data, user information, or system prompts.
    • Jailbreaks: Crafting inputs that bypass safety filters and ethical boundaries.
    • Data poisoning: Corrupting training data or context to manipulate model outputs.

    Observations

    The article frames AI red teaming as essential before deployment. This is reasonable, but the source does not independently verify all claims about vulnerability scope. The methodology is standard red team practice adapted for AI specifics. The field still lacks standardized frameworks.

    The distinction between code vulnerabilities and intent exploitation is operationally significant. Traditional fuzzing and penetration testing do not cover the language interpretation attack surface. Organizations integrating LLMs into critical infrastructure should treat red team assessment as a deployment prerequisite.

    🔹 AI #RedTeaming #LLMSecurity #PromptInjection #AdversarialML

    🔗 Source: blog.securelayer7.net/ai-red-t

  2. Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable — tools like Heretic make this visible.

    A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.

    #AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM

  3. Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable — tools like Heretic make this visible.

    A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.

    #AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM

  4. Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable — tools like Heretic make this visible.

    A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.

    #AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM

  5. Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable — tools like Heretic make this visible.

    A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.

    #AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM

  6. Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable — tools like Heretic make this visible.

    A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.

    #AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM

  7. ----------------

    🔒 AI Pentesting Roadmap — LLM Security and Offensive Testing
    ===================

    Overview

    This roadmap provides a structured learning path for practitioners aiming to assess and attack AI/ML systems, with a focus on LLMs and related pipelines. It organizes topics into progressive phases: foundations in ML and APIs, core AI security concepts, prompt injection and LLM-specific attacks, hands-on labs, advanced exploitation techniques, and real-world research/bug bounty work.

    Phased Structure

    Phase 1 (Foundations) covers machine learning fundamentals and LLM internals, including model architectures and tokenization concepts. Phase 2 (AI/ML Security Concepts) anchors the curriculum on standards and frameworks such as OWASP LLM Top 10, MITRE ATLAS, and NIST AI risk guidance. Phase 3 focuses on prompt injection and LLM adversarial vectors, describing attack surfaces like context manipulation, instruction-following bypasses, and RAG pipeline poisoning. Phase 4 emphasizes hands-on practice through CTFs, sandboxed labs, and safe testing methodologies. Phase 5 explores advanced exploitation: model poisoning, data poisoning, backdoor techniques, and chaining vulnerabilities across API/authentication layers. Phase 6 targets real-world research, disclosure workflows, and bug bounty engagement.

    Technical Coverage

    The roadmap lists practical tooling and repositories for experiment design and testing concepts without prescribing deployment steps. It calls out necessary foundations—Python programming, HTTP/API mechanics, and web security basics (XSS, SSRF, SQLi) to support end-to-end attack scenarios against AI systems. Notable conceptual risks include RAG poisoning, adversarial ML perturbations, prompt injection, and leakage through augmented memory or external tool integrations.

    Limitations & Considerations

    The guide is educational and emphasizes conceptual descriptions of capabilities and use cases rather than operational recipes. It highlights standards and references rather than prescriptive mitigations. Practical exploration should respect ethical boundaries and responsible disclosure norms.

    🔹 OWASP #MITRE_ATLAS #RAG #prompt_injection #adversarialML

    🔗 Source: github.com/anmolksachan/AI-ML-

  8. ----------------

    🛠️ Tool
    ===================

    Opening:
    BlackIce is a containerized red‑teaming toolkit that aggregates 14 open‑source AI security tools into a single, version‑pinned runtime image. The release focuses on reproducibility, dependency isolation, and a unified command‑line surface to simplify assessments of LLMs and ML artifacts.

    Key Features:
    • Aggregation of 14 OSS projects spanning Responsible AI, adversarial ML, and model testing, including lm_eval_harness, promptfoo, cleverhans, garak, ART, Giskard, and CyberSecEval.
    • Mapping of capabilities to MITRE ATLAS techniques and the Databricks AI Security Framework (DASF), explicitly covering prompt injection (AML.T0051), data leakage (AML.T0057), and hallucination discovery (AML.T0062).
    • Dual tool model: static tools installed in isolated virtual environments and dynamic tools available in the global Python environment with a global_requirements management approach.

    Technical Implementation:
    • The toolkit ships as a version‑pinned container image intended to provide a reproducible compute environment; static tools run from isolated Python venvs or Node.js projects while dynamic tools integrate into a shared Python environment for extensibility.
    • Capability mapping was performed to show coverage across attack classes such as prompt injection, indirect injection via external content (RAG/email), adversarial example generation, and supply‑chain artifact checks (malicious pickles/artifacts).

    Use Cases:
    • Red teams validating model behavior against prompt injection and jailbreak vectors.
    • Privacy testing focusing on model outputs that may leak sensitive training data.
    • Analysts stress‑testing hallucination rates and evaluating hallucination detection tooling.
    • Supply‑chain reviewers scanning artifacts for unsafe or malicious components.

    Limitations:
    • The release bundles a curated selection of tools (14 in the initial image); coverage depends on those projects’ individual capabilities and update cadence.
    • Dependency isolation strategy splits static vs. dynamic tools which can limit some integrated workflows that assume a single interpreter for all components.
    • No single integrated orchestration layer for multi‑tool scenarios was described; users must coordinate tool workflows within the container environment.

    References:
    • Tool list and organization attribution provided by the release notes; mappings reference MITRE ATLAS technique IDs and Databricks DASF controls.

    🔹 tool #AIsecurity #MITRE #adversarialML #supplychain

    🔗 Source: databricks.com/blog/announcing

  9. European researchers report that poetic prompts can bypass safety guardrails in multiple LLMs, exposing gaps in classifier-based moderation.

    A good reminder that safety systems must evolve alongside generative models - especially as adversarial creativity becomes easier to automate.

    What direction should improvements take?

    Source: wired.com/story/poems-can-tric

    Follow us for more neutral and security-focused AI updates.

    #AISafety #LLMSecurity #AdversarialML #CyberSecurity #MLResearch #TechNadu

  10. European researchers report that poetic prompts can bypass safety guardrails in multiple LLMs, exposing gaps in classifier-based moderation.

    A good reminder that safety systems must evolve alongside generative models - especially as adversarial creativity becomes easier to automate.

    What direction should improvements take?

    Source: wired.com/story/poems-can-tric

    Follow us for more neutral and security-focused AI updates.

    #AISafety #LLMSecurity #AdversarialML #CyberSecurity #MLResearch #TechNadu

  11. European researchers report that poetic prompts can bypass safety guardrails in multiple LLMs, exposing gaps in classifier-based moderation.

    A good reminder that safety systems must evolve alongside generative models - especially as adversarial creativity becomes easier to automate.

    What direction should improvements take?

    Source: wired.com/story/poems-can-tric

    Follow us for more neutral and security-focused AI updates.

    #AISafety #LLMSecurity #AdversarialML #CyberSecurity #MLResearch #TechNadu

  12. 🎯 AI
    ===================

    Executive summary:
    The reported incident describes the use of an LLM-based assistant to craft inputs that evaded an AI-driven web application firewall called NeuroShield. The same report details exploitation of an overlooked API rate limiting control which culminated in a full account takeover. The narrative is a first-person account of testing and bypassing layered AI defenses.

    Technical details:
    • Target product: NeuroShield (described as an "AI-powered WAF" claiming "stop 99.9% of attacks").
    • Evasion technique: Use of an LLM (ChatGPT) to reformulate attack payloads that the WAF misclassified as benign (the article references SQL injection payloads framed as "friendly database compliments").
    • Secondary weakness: An unprotected or misconfigured API rate limit that allowed amplification or repeated requests leading to account compromise.
    • Outcomes stated: Successful bypass of the WAF and a full account takeover (author-reported).

    Analysis:
    The incident highlights two intertwined failure modes: firstly, reliance on model-derived classifiers as a primary gate can be undermined by adversarially generated inputs that exploit model decision boundaries; secondly, conventional API hardening (rate limiting) remains critical and, when absent or misapplied, can enable escalation from bypass to account compromise. The report does not publish IoCs, payload samples, or specific API endpoints.

    Detection:
    The original account does not provide detection signatures or IOCs. Observables that could be relevant (but are not supplied in the report) include anomalous request rates from a single client, repetitive similar payload variants that differ at the surface level but test model boundaries, and unusual session token usage patterns.

    Mitigation / Author guidance:
    The article provides an experiential narrative and does not include formal mitigation playbooks or defensive configurations. No concrete mitigations or patch details were published alongside the write-up.

    References / notes:
    • Product named in the report: NeuroShield
    • Methodology: LLM-assisted payload generation; exploitation of API rate limiting

    🔹 AI #WAF #adversarialML #account_takeover #infosec

    🔗 Source: infosecwriteups.com/how-i-made

  13. In related news, I need to recruit 1-2 new PhD students starting next fall!!

    Likely research topics: Adversarial and explainable ML for large models of text and code.

    (And maybe probabilistic and relational models if another project gets funded.)

    If you want to email me about this, please include “capybara” in the subject line so I know it’s a specific response and not a blanket query.

    #recruiting #adversarialML

  14. In related news, I need to recruit 1-2 new PhD students starting next fall!!

    Likely research topics: Adversarial and explainable ML for large models of text and code.

    (And maybe probabilistic and relational models if another project gets funded.)

    If you want to email me about this, please include “capybara” in the subject line so I know it’s a specific response and not a blanket query.

    #recruiting #adversarialML

  15. In related news, I need to recruit 1-2 new PhD students starting next fall!!

    Likely research topics: Adversarial and explainable ML for large models of text and code.

    (And maybe probabilistic and relational models if another project gets funded.)

    If you want to email me about this, please include “capybara” in the subject line so I know it’s a specific response and not a blanket query.

    #recruiting #adversarialML

  16. In related news, I need to recruit 1-2 new PhD students starting next fall!!

    Likely research topics: Adversarial and explainable ML for large models of text and code.

    (And maybe probabilistic and relational models if another project gets funded.)

    If you want to email me about this, please include “capybara” in the subject line so I know it’s a specific response and not a blanket query.

    #recruiting #adversarialML

  17. In related news, I need to recruit 1-2 new PhD students starting next fall!!

    Likely research topics: Adversarial and explainable ML for large models of text and code.

    (And maybe probabilistic and relational models if another project gets funded.)

    If you want to email me about this, please include “capybara” in the subject line so I know it’s a specific response and not a blanket query.

    #recruiting #adversarialML

  18. Since some people have been asking, here's a preprint:
    arxiv.org/abs/2208.13904

    TL;DR: You can get certified guarantees on robust regression against poisoning and other training set attacks. The trick is to use a voting based predictor (like an ensemble or k-NN) and median.

    We made some revisions during the author feedback and discussion period which haven’t yet been incorporated into the arXiv version. I’ll post again when we have the camera-ready version.
    #SaTML #AdversarialML #NewPaper

  19. Since some people have been asking, here's a preprint:
    arxiv.org/abs/2208.13904

    TL;DR: You can get certified guarantees on robust regression against poisoning and other training set attacks. The trick is to use a voting based predictor (like an ensemble or k-NN) and median.

    We made some revisions during the author feedback and discussion period which haven’t yet been incorporated into the arXiv version. I’ll post again when we have the camera-ready version.
    #SaTML #AdversarialML #NewPaper

  20. Since some people have been asking, here's a preprint:
    arxiv.org/abs/2208.13904

    TL;DR: You can get certified guarantees on robust regression against poisoning and other training set attacks. The trick is to use a voting based predictor (like an ensemble or k-NN) and median.

    We made some revisions during the author feedback and discussion period which haven’t yet been incorporated into the arXiv version. I’ll post again when we have the camera-ready version.
    #SaTML #AdversarialML #NewPaper

  21. Since some people have been asking, here's a preprint:
    arxiv.org/abs/2208.13904

    TL;DR: You can get certified guarantees on robust regression against poisoning and other training set attacks. The trick is to use a voting based predictor (like an ensemble or k-NN) and median.

    We made some revisions during the author feedback and discussion period which haven’t yet been incorporated into the arXiv version. I’ll post again when we have the camera-ready version.
    #SaTML #AdversarialML #NewPaper

  22. Since some people have been asking, here's a preprint:
    arxiv.org/abs/2208.13904

    TL;DR: You can get certified guarantees on robust regression against poisoning and other training set attacks. The trick is to use a voting based predictor (like an ensemble or k-NN) and median.

    We made some revisions during the author feedback and discussion period which haven’t yet been incorporated into the arXiv version. I’ll post again when we have the camera-ready version.
    #SaTML #AdversarialML #NewPaper

  23. Our paper on adversarially-robust regression was accepted to SaTML 2023 (satml.org) -- the first ever IEEE Conference on Secure and Trustworthy Machine Learning!

    I'm really excited about this conference and hoping to see it take off. There's so much important work to do in this area.
    #SaTML #AdversarialML

  24. Our paper on adversarially-robust regression was accepted to SaTML 2023 (satml.org) -- the first ever IEEE Conference on Secure and Trustworthy Machine Learning!

    I'm really excited about this conference and hoping to see it take off. There's so much important work to do in this area.
    #SaTML #AdversarialML

  25. Our paper on adversarially-robust regression was accepted to SaTML 2023 (satml.org) -- the first ever IEEE Conference on Secure and Trustworthy Machine Learning!

    I'm really excited about this conference and hoping to see it take off. There's so much important work to do in this area.
    #SaTML #AdversarialML

  26. Our paper on adversarially-robust regression was accepted to SaTML 2023 (satml.org) -- the first ever IEEE Conference on Secure and Trustworthy Machine Learning!

    I'm really excited about this conference and hoping to see it take off. There's so much important work to do in this area.
    #SaTML #AdversarialML

  27. Our paper on adversarially-robust regression was accepted to SaTML 2023 (satml.org) -- the first ever IEEE Conference on Secure and Trustworthy Machine Learning!

    I'm really excited about this conference and hoping to see it take off. There's so much important work to do in this area.
    #SaTML #AdversarialML

  28. In #AdversarialML, targeted training set attacks are one of the biggest threats to #MachineLearning -- highly effective and hard to detect!

    In a #NewPaper at #CCS2022 this week, Zayd Hammoudeh and I show how you can use #InfluenceEstimation to detect, understand, and stop these attacks!

    Our methods work against backdoor and poisoning attacks, in vision/test/audio domains, and against adaptive attackers.

    dl.acm.org/doi/10.1145/3548606

  29. In #AdversarialML, targeted training set attacks are one of the biggest threats to #MachineLearning -- highly effective and hard to detect!

    In a #NewPaper at #CCS2022 this week, Zayd Hammoudeh and I show how you can use #InfluenceEstimation to detect, understand, and stop these attacks!

    Our methods work against backdoor and poisoning attacks, in vision/test/audio domains, and against adaptive attackers.

    dl.acm.org/doi/10.1145/3548606

  30. In #AdversarialML, targeted training set attacks are one of the biggest threats to #MachineLearning -- highly effective and hard to detect!

    In a #NewPaper at #CCS2022 this week, Zayd Hammoudeh and I show how you can use #InfluenceEstimation to detect, understand, and stop these attacks!

    Our methods work against backdoor and poisoning attacks, in vision/test/audio domains, and against adaptive attackers.

    dl.acm.org/doi/10.1145/3548606

  31. In #AdversarialML, targeted training set attacks are one of the biggest threats to #MachineLearning -- highly effective and hard to detect!

    In a #NewPaper at #CCS2022 this week, Zayd Hammoudeh and I show how you can use #InfluenceEstimation to detect, understand, and stop these attacks!

    Our methods work against backdoor and poisoning attacks, in vision/test/audio domains, and against adaptive attackers.

    dl.acm.org/doi/10.1145/3548606

  32. In #AdversarialML, targeted training set attacks are one of the biggest threats to #MachineLearning -- highly effective and hard to detect!

    In a #NewPaper at #CCS2022 this week, Zayd Hammoudeh and I show how you can use #InfluenceEstimation to detect, understand, and stop these attacks!

    Our methods work against backdoor and poisoning attacks, in vision/test/audio domains, and against adaptive attackers.

    dl.acm.org/doi/10.1145/3548606

  33. @simon @parasbhargava there’s a while literature on adversarial machine learning — if you search for that term, you’ll find lots of info on automated attacks against machine models. Tutorials, blog posts, books. The field is moving very quickly, and different overviews will focus on different aspects, so I don’t have a single recommended overview offhand.

    I’m hoping we can use the hashtag #adversarialML for discussing this topic on Mastodon.

  34. @simon @parasbhargava there’s a while literature on adversarial machine learning — if you search for that term, you’ll find lots of info on automated attacks against machine models. Tutorials, blog posts, books. The field is moving very quickly, and different overviews will focus on different aspects, so I don’t have a single recommended overview offhand.

    I’m hoping we can use the hashtag #adversarialML for discussing this topic on Mastodon.

  35. @simon @parasbhargava there’s a while literature on adversarial machine learning — if you search for that term, you’ll find lots of info on automated attacks against machine models. Tutorials, blog posts, books. The field is moving very quickly, and different overviews will focus on different aspects, so I don’t have a single recommended overview offhand.

    I’m hoping we can use the hashtag #adversarialML for discussing this topic on Mastodon.

  36. @simon @parasbhargava there’s a while literature on adversarial machine learning — if you search for that term, you’ll find lots of info on automated attacks against machine models. Tutorials, blog posts, books. The field is moving very quickly, and different overviews will focus on different aspects, so I don’t have a single recommended overview offhand.

    I’m hoping we can use the hashtag #adversarialML for discussing this topic on Mastodon.

  37. Some thoughts on #adversarialML #machinelearning attacks in different domains (partly in response to comments by @simon):

    IMAGES:
    Attacks against image classifiers are quite effective because image classification is so hard to begin with! A deep network needs to use every little scrap of signal just to distinguish a dog from a dogwood tree — when the signal is ambiguous, something as small as fur texture vs. bark texture might be the deciding “vote.”

  38. Some thoughts on #adversarialML #machinelearning attacks in different domains (partly in response to comments by @simon):

    IMAGES:
    Attacks against image classifiers are quite effective because image classification is so hard to begin with! A deep network needs to use every little scrap of signal just to distinguish a dog from a dogwood tree — when the signal is ambiguous, something as small as fur texture vs. bark texture might be the deciding “vote.”

  39. Some thoughts on #adversarialML #machinelearning attacks in different domains (partly in response to comments by @simon):

    IMAGES:
    Attacks against image classifiers are quite effective because image classification is so hard to begin with! A deep network needs to use every little scrap of signal just to distinguish a dog from a dogwood tree — when the signal is ambiguous, something as small as fur texture vs. bark texture might be the deciding “vote.”

  40. Some thoughts on #adversarialML #machinelearning attacks in different domains (partly in response to comments by @simon):

    IMAGES:
    Attacks against image classifiers are quite effective because image classification is so hard to begin with! A deep network needs to use every little scrap of signal just to distinguish a dog from a dogwood tree — when the signal is ambiguous, something as small as fur texture vs. bark texture might be the deciding “vote.”