Skip to main content

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

· One min read
Zain Hasan

A preview of the paper

🗣️Persuasive Adversarial Prompting to Jailbreak LLMs with 92% Success Rate

🔒Fascinating new paper breaks down jailbreak prompting to a science!

⏩In Short:

  1. Provide a taxonomy of 40 persuasion prompting techniques

  2. Use this list of 40 techniques they can jailbreak LLMs including GPT4 with a 92% success rate!!

  3. Pretty interestingly Anthropic models are not susceptible at all to PAP attacks!! More advanced models like GPT-4 are more vulnerable to persuasive adversarial prompts (PAPs).

  4. If you can defend against these PAPs this also provides effective protection against other attacks

  5. Test these PAPs to perform attacks covering 14 different risk categories (such as economic harm, etc.)

Blog+Demo: https://chats-lab.github.io/persuasive_jailbreaker/

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, and begin building amazing apps with the free trial of Weaviate Cloud (WCD).

Don't want to miss another blog post?

Sign up for our bi-weekly newsletter to stay updated!


By submitting, I agree to the Terms of Service and Privacy Policy.