How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

🗣️Persuasive Adversarial Prompting to Jailbreak LLMs with 92% Success Rate
🔒Fascinating new paper breaks down jailbreak prompting to a science!
⏩In Short:
-
Provide a taxonomy of 40 persuasion prompting techniques
-
Use this list of 40 techniques they can jailbreak LLMs including GPT4 with a 92% success rate!!
-
Pretty interestingly Anthropic models are not susceptible at all to PAP attacks!! More advanced models like GPT-4 are more vulnerable to persuasive adversarial prompts (PAPs).
-
If you can defend against these PAPs this also provides effective protection against other attacks
-
Test these PAPs to perform attacks covering 14 different risk categories (such as economic harm, etc.)
Blog+Demo: https://chats-lab.github.io/persuasive_jailbreaker/
Ready to start building?
Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).
Don't want to miss another blog post?
Sign up for our bi-weekly newsletter to stay updated!
By submitting, I agree to the Terms of Service and Privacy Policy.