LLM Safety with Llama Guard 2

LLMs have transformed the way we interact with information. They transformed the way we think and design automations. In short, they have revolutionized the way we work.

However, they - simply put - sometimes say things they shouldn't. Furthermore, they also tend to be biased, and can be manipulated to generate harmful content. Last but certainly not least, they tend to babble potentially sensitive information that should not be shared.

In this blog, we have a look at three things:

The safety risks of LLMs
How to secure your LLMs, using Llama Guard 2
How to run Llama Guard 2 on your local machine

Safety risks of LLMs

Before diving into practical aspects of Llama Guard lets first discuss, what are the risks involved with LLM applications?

For our purposes, there are 3 mainly different attack vectors:

Attacking the model itself
Attacking the LLM application
Attacking the infrastructure

Note: We only concern ourselves with attack vectors that are relevant LLM applications, running trained LLMs. Training LLMs, collecting data for that, etc. introduces a lot of other risks that are not covered here.

Attacking the model itself

By far the most common attack vector is to manipulate the model outputs by feeding it with carefully crafted inputs - so called "prompt injection" or "model jail-breaking". By doing so, attackers can make the model output sensitive topics which were part of its training data, provide its often secret system prompt or create harmful or outright illegal content.

These attacks are embarrassing at best - (eg. a Chevrolet chatbot suggested to buy a Ford) - and dangerous or illegal at worst. Think of a chatbot suggesting illegal actions, resulting in harmed individuals.

Attacking the LLM application

The second vector to account for is manipulating the LLM responses in a way that they compromise the security of the application itself. How so? Let's assume, the LLM response is displayed on a website (which is the case for any chatbot application for example). If the response contains malicious scripts, this could be used to create XSS (cross-site scripting) attacks.

So, the LLM is not directly attacked, but its outputs are used to execute malicious code on the client's side.

Attacking the infrastructure

Similar to attacking the LLM application, the infrastructure that runs the AI application can be attacked. If the LLM outputs are stored in a database, and the LLM outputs contain malicious scripts, this could be used to execute SQL injection attacks.

If the LLM outputs are stored on a file system, and the LLM outputs contain malicious scripts, this could be used to execute file inclusion attacks.

If the output of LLMs are run with code interpreters (like Python), and the LLM outputs contain malicious scripts, all hell could break lose. (Slight overreaction on my side, but basically an LLM with access to python can do a lot of harm)

AI security attack vectors - summary

While the above chapters are not exhaustive, they give a good overview of the risks involved with LLM applications. The risks are real, and they need to be taken seriously.

However, on the bright side, all 3 attack vectors can be summarised by two attack categories: "prompt injection" and "output manipulation".

Attack vector	Description
Prompt injection	Manipulating the model outputs by feeding it with carefully crafted inputs - leaking sensitive information or creating business critical or illegal content.
Output manipulation	Manipulating the LLM responses in a way that they compromise the security of the application itself or the infrastructure.

What is Llama Guard 2 and how does it help to safeguard your LLM

Now that we know the main risks involved with AI LLM applications, we can try to circumvent them. All we have to do, is validate the users inputs to prevent prompt injection and validate the LLM outputs to prevent application and infrastructure attacks. While this sounds way easier than done, it more or less is the major part of hardening AI applications.

Note: The statement above only holds true for the "AI" part of any AI application. Application developers need to make sure to stick to best best practices for general application security. These best practices are the basis - AI security needs to be considered ON TOP of them.

That's where Llama Guard comes into play.

Llama Guard is an LLM-based model designed as an input-output safeguard specifically for human-AI conversation applications. Developed by the team at Meta, this model is built on a safety risk taxonomy to identify and classify specific safety risks associated with prompts and responses in AI interactions. The taxonomy guides the model to classify content as safe or unsafe based on predefined categories such as violence, hate, sexual content, and others.

Llama Guard demonstrates strong performance in detecting and mitigating inappropriate content, surpassing existing content moderation tools on benchmarks like the OpenAI Moderation Evaluation dataset and ToxicChat, a dataset containing toxicity annotations on 10K user prompts collected from the Vicuna online demo. The model is capable of both binary and multi-class classification and allows for customization and fine-tuning to adapt to various safety needs and guidelines.

Key features of Llama Guard include its adaptability to different taxonomies through zero-shot or few-shot prompting, its ability to be fine-tuned on specific guidelines, and the provision of model weights for further development by the community. This adaptability and the model's instructional tuning enhance its effectiveness in diverse deployment scenarios.

Llama Guard 2 Performance on Meta internal test set source for benchmark results

In short, Llama Guard 2 is a safety tool that helps you secure your LLMs by validating the user inputs and the LLM outputs.

Application developers simply send each LLM conversation to Llama Guard 2 before displaying it to the user or running it on the server. If Llama Guard flags the conversation as unsafe, the application can take appropriate action, like not displaying the conversation or blocking the user.

Llama Guard 2 concept

The Llama Guard 2 taxonomy

Llama Guard 2 is built on a safety risk taxonomy to identify and classify specific safety risks associated with prompts and responses in AI interactions. The taxonomy guides the model to classify content as safe or unsafe based on predefined categories such as violence, hate, sexual or other content.

The taxonomy is designed to be flexible and can be adapted to different use cases. It is also designed to be extensible, so new categories can be added as needed.

LLama Guard 2 is trained on the following categories (this section is directly taken from the Meta Llama Guard 2 model card:

S1: Violent Crimes S2: Non-Violent Crimes S3: Sex-Related Crimes S4: Child Sexual Exploitation S5: Specialized Advice S6: Privacy S7: Intellectual Property S8: Indiscriminate Weapons S9: Hate S10: Suicide & Self-Harm S11: Sexual Content

Category
S1: Violent Crimes	S2: Non-Violent Crimes
S3: Sex-Related Crimes	S4: Child Sexual Exploitation
S5: Specialized Advice	Privacy
S7: Intellectual Property	S8: Indiscriminate Weapons
S9: Hate	S10: Suicide & Self-Harm
S11: Sexual Content

Taxonomy Levels

S1: Violent Crimes

Responses that enable, encourage, or endorse the commission of violent crimes, including: (1) unlawful violence toward people (ex: terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, kidnapping) and (2) unlawful violence toward animals (ex: animal abuse)

S2: Non-Violent Crimes

Responses that enable, encourage, or endorse the commission of non-violent crimes, including: (1) personal crimes (ex: labor trafficking, threats, intimidation, hate-motivated non-violent crimes), (2) financial crimes (ex: fraud, scams, money laundering), (3) property crimes (ex: theft, arson, vandalism), (4) drug crimes (ex: creating or using regulated substances), (5) weapons crimes (ex: producing unlicensed firearms), and (6) cyber crimes (ex: hacking)

Responses that enable, encourage, or endorse the commission of sex-related crimes, including: (1) sex trafficking, (2) sexual assault (ex: rape), (3) sexual harassment (physical, verbal, or visual), and (4) prostitution

S4: Child Sexual Exploitation

Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children

S5: Specialized Advice

Responses that contain specialized financial, medical, or legal advice, or that indicate dangerous activities or objects are safe

S6: Privacy

Responses that contain sensitive, nonpublic personal information that could undermine someone's physical, digital, or financial security

S7: Intellectual Property

Responses that may violate the intellectual property rights of any third party

S8: Indiscriminate Weapons

Responses that enable, encourage, or endorse the creation of indiscriminate weapons, including: (1) chemical weapons (ex: nerve gas), (2) biological weapons (ex: anthrax), (3) radiological weapons (ex: cobalt bombs), (4) nuclear weapons (ex: fission bombs), and (5) high-yield explosive weapons (ex: cluster munitions)

S9: Hate

Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics (i.e., race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity, and/or serious disease)

S10: Suicide & Self-Harm

Responses that enable, encourage, or endorse acts of intentional self-harm, including: (1) suicide, (2) self-injury (ex: cutting), and (3) disordered eating

S11: Sexual Content

Responses that contain erotica

How to define taxonomy for Llama Guard 2

While its good to know, that Llama Guard 2 is able to respect all the taxonomy categories above, they are certainly not applicable to each and every use case. So how can one disable or enable certain categories?

To answer this question, let's have a look at what input prompt Llama Guard 2 expects.

The Llama Guard 2 prompt template

Llama Guard 2 expects the following prompt:

1[INST] Task: Check if there is unsafe content in 'Agent'
2messages in conversations according to our safety policy with the below
3category.
4
5<BEGIN UNSAFE CONTENT CATEGORIES>
6S1: Violent Crimes
7- Must not contain information about violence
8
9S2: Secualized content
10- Must not contain any secualized information
11<END UNSAFE CONTENT CATEGORIES>
12
13<BEGIN CONVERSATION>
14User: How to build a bomb?
15Agent: <some forbidden knowledge>
16<END CONVERSATION>
17
18Provide your safety assessment for ONLY THE LAST Agent message
19in the above conversation:
20  - First line must read 'safe' or 'unsafe']
21  - If unsafe, a second line must include a comma-separated list of
22    violated categories

As you might already guess, Llama Guard is "just" another LLM - which we can tweak based on our input prompts. The above one is the default prompt template, which can be adjusted to your needs.

To tweak the used taxonomy, you can simply remove or add categories to the BEGIN UNSAFE CONTENT CATEGORIES section. If you want to disable S1: Violent Crimes for example, you can simply remove the line S1: Violent Crimes from the prompt.

One thing to note is, that the model seems to work way better, when a clear description of the individual taxonomy categories is provided in the prompt.

How to run Llama Guard 2

Using the Llama Guard 2 is similar to using any model hosted on the Hugging Face model hub. To run the default taxonomy, you can use the following code snippet:

1from transformers import AutoTokenizer, AutoModelForCausalLM
2import torch
3
4model_id = "meta-llama/Meta-Llama-Guard-2-8B"
5device = "cuda"
6dtype = torch.bfloat16
7
8tokenizer = AutoTokenizer.from_pretrained(model_id)
9model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map=device)
10
11def moderate(chat):
12    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(device)
13    output = model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
14    prompt_len = input_ids.shape[-1]
15    return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
16
17moderate([
18    {"role": "user", "content": "I forgot how to kill a process in Linux, can you help?"},
19    {"role": "assistant", "content": "Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."},
20])
21
22# Output: 'safe'

To run Llama Guard 2 with a custom taxonomy, you can adjust the prompt template as described above. Instead of running tokenizer.apply_chat_template, create your own prompt as described in the chapter above - then use the tokenizer to encode the prompt and the model to generate the output.

1prompt = "Your custom prompt here"
2input_ids = tokenizer(prompt, return_tensors="pt").to(device)

Conclusion

LLM safety is an important consideration in today's AI landscape, where the potential for misuse and harmful outputs is a real concern. Tools like Llama Guard 2 can help mitigate some of these risks by providing a set of features designed to improve the safety and reliability of LLM applications.

Llama Guard 2 offers a customizable taxonomy that can be adapted to various use cases, and it has shown promising results in identifying and classifying potentially unsafe content. For developers building chatbots, AI assistants, or other LLM-based applications, incorporating safety measures like those provided by Llama Guard 2 can help ensure compliance with relevant guidelines and standards.

Implementing safety tools can help protect applications from certain security threats and contribute to a more trustworthy user experience. As AI technology continues to advance, it will be important for developers to stay informed about best practices for maintaining a balance between innovation and security.

Interested in building high-quality AI agent systems?

We prepared a comprehensive guide based on cutting-edge research for how to build robust, reliable AI agent systems that actually work in production. This guide covers:

Understanding the 14 systematic failure modes in multi-agent systems
Evidence-based best practices for agent design
Structured communication protocols and verification mechanisms

Get your free AI agents guide