#adversarialml — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #adversarialml, aggregated by home.social.
-
Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable — tools like Heretic make this visible.
A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.
#AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM