New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama-guard does not resepect custom Taxonomy #7
Comments
Hi, and thanks for trying out Llama Guard! I tried your above policy with the model for reproducing the behavior. Here is my prompt:
Here are some examples I've tried with this, and model generations + classifier scores (=unsafe token probs):
To check that these all return safe with very low scores, I also ran these examples with the default policy. Results:
Finally I'd like to add 2 notes:
|
Thank you for the detailed response, appreciate it. However, I didn't try to check the token proba. That seems like a good idea. Overall, I quite like the idea of llama-guard. The prompting is simple enough and has potential. It standardize the way to content moderation. Can't wait to see the next version of it. Also, is there a way to provide few examples (few-shot prompting) for custom taxonomy? I would like to try that too to see if that makes different in output probability? |
Thanks for sharing your analysis in the notebook! Generally, we'd suggest trying only 1 category if trying zero-shot prompting. It may not work the best when there are multiple categories, as you have experimented and saw in the colab notebook. What you can do is that you can have multiple categories in different prompts, and get the same desired result via multiple parallel runs. I hope that works for you! The current model was trained on only subsets of the 6 default categories using a single dataset, with some data augmentations to promote the zero-shot promptability aspect - hence general zero-shot prompting is expected to not be plug-and-play. |
hi,
can you share your script to compute the probs? Thanks~ |
I have tried adding some custom taxonomy to unsafe_categories in llama_guard example (safety_utils.py) like this:
However, no matter how politically correct/incorrect highly political conversation I pass, it always return 'safe' response.
When testing for existing taxonomy, I found that the default taxonomy ( 6 categories ) seems to work correctly.
In the paper Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations under section 3.2 it states that I can do zero-shot prompting ( which as explained above does not work) or few-shot prompting.. However, there are no examples as in which format to pass the example conversations in the prompt.
The text was updated successfully, but these errors were encountered: