[WIP/Do Not Review]Chatbot Recipe #403

HamidShojanazeri · 2024-03-20T18:28:26Z

Main Goal

Building an e2e recipe for building chatbots.

High level idea

We want to focus on following stages:

Data pipelines for creating datasets for chatbots
Data processing/ quality assurance practices/pipelines/tooling
Evaluation process
Fine-tuning a model/ Best practices for fine-tuning/ LORA/QLORA/ hyper params.

Use-case

Llama FAQ model using OSS llama docs, github docs, papers, website, etc.
Proposed data pipeline, using Llama 70b or 13b as the teacher model to create Q&A pairs from Llama docs as mentioned above.[ Open to any other ideas here]
Data Quality/ Eval using same teacher model [ Open to any other ideas here]

richardliaw · 2024-04-03T20:36:03Z

recipes/use_cases/end2end-recipes/chatbot/README.md

+
+- **Navigating Dataset Limitations**: The perfect dataset for a specific task may not exist. Be mindful of the limitations when choosing from available resources, and understand the potential impact on your project.
+
+#### **Best Practices for FineTuning Data Preparation**


Generally this is a bit high level; I think a more concrete walkthrough would be much more useful

richardliaw · 2024-04-03T20:36:34Z

recipes/use_cases/end2end-recipes/chatbot/README.md

+
+Gathering a diverse and comprehensive dataset is crucial. This dataset should include a wide range of topics and conversational styles to ensure the model can handle various subjects. A recent [research](https://arxiv.org/pdf/2305.11206.pdf) shows that quality of data has far more importance than quantity. Here are some high level thoughts on data collection and preprocessing along with best practices:
+
+**NOTE** data collection and processing is very use-case specific and here we can only share best practices but it would be very nuanced for each use-case.


Prefer a more specific use case -- like what does "diversity" mean?

richardliaw · 2024-04-03T20:38:08Z

recipes/use_cases/end2end-recipes/chatbot/README.md

+**Tools**
+
+- [wimbd](https://github.com/allenai/wimbd) for data analysis.
+- TBD


lilac?
wandb?

richardliaw · 2024-04-03T20:38:40Z

recipes/use_cases/end2end-recipes/chatbot/README.md

+
+**Data Decontamination**
+
+The process involves eliminating evaluation data from the training dataset. This crucial preprocessing step maintains the accuracy of model evaluation, guaranteeing that performance metrics are trustworthy and not skewed.


how do you actually do this in practice? is it like data dedup?

richardliaw · 2024-04-03T20:39:39Z

recipes/use_cases/end2end-recipes/chatbot/README.md

+
+##### Low Quality Dataset
+
+Below are some examples of real test on the fine-tuned model with very poor results. It seems fine-tuned model does not show any promising results with this dataset. Looking at the dataset, we could observe that the amount of data (Q&A pair) for each concept such as PyTorch FSDP and Llama-Recipe is very limited and almost one pair per concept. This shows lack of relevant training data. The recent research showed that from each taxonomy having 2-3 examples can yield promising results.


what makes the data low-quality?

how do i detect this for my own data?

richardliaw · 2024-04-03T20:42:30Z

recipes/use_cases/end2end-recipes/chatbot/data_pipelines/generate_question_answers.py

+
+# Please implement your own chat service class here.
+# The class should inherit from the ChatService class and implement the execute_chat_request_async method.
+class OctoAIChatService(ChatService):


maybe just use one of those generic openai clients like litellm?

HamidShojanazeri added 4 commits March 20, 2024 18:07

adding main readme

0c25f3d

adding config files

499378e

adding data geenration pipe

d10ef32

adding README

33839e9

facebook-github-bot added the cla signed label Mar 20, 2024

updates

2405a39

richardliaw reviewed Apr 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP/Do Not Review]Chatbot Recipe #403

[WIP/Do Not Review]Chatbot Recipe #403

HamidShojanazeri commented Mar 20, 2024

richardliaw Apr 3, 2024

richardliaw Apr 3, 2024

richardliaw Apr 3, 2024

richardliaw Apr 3, 2024

richardliaw Apr 3, 2024

richardliaw Apr 3, 2024


		- Navigating Dataset Limitations: The perfect dataset for a specific task may not exist. Be mindful of the limitations when choosing from available resources, and understand the potential impact on your project.

		#### Best Practices for FineTuning Data Preparation


		Gathering a diverse and comprehensive dataset is crucial. This dataset should include a wide range of topics and conversational styles to ensure the model can handle various subjects. A recent [research](https://arxiv.org/pdf/2305.11206.pdf) shows that quality of data has far more importance than quantity. Here are some high level thoughts on data collection and preprocessing along with best practices:

		NOTE data collection and processing is very use-case specific and here we can only share best practices but it would be very nuanced for each use-case.


		Data Decontamination

		The process involves eliminating evaluation data from the training dataset. This crucial preprocessing step maintains the accuracy of model evaluation, guaranteeing that performance metrics are trustworthy and not skewed.


		##### Low Quality Dataset

		Below are some examples of real test on the fine-tuned model with very poor results. It seems fine-tuned model does not show any promising results with this dataset. Looking at the dataset, we could observe that the amount of data (Q&A pair) for each concept such as PyTorch FSDP and Llama-Recipe is very limited and almost one pair per concept. This shows lack of relevant training data. The recent research showed that from each taxonomy having 2-3 examples can yield promising results.

[WIP/Do Not Review]Chatbot Recipe #403

Are you sure you want to change the base?

[WIP/Do Not Review]Chatbot Recipe #403

Conversation

HamidShojanazeri commented Mar 20, 2024

Main Goal

High level idea

Use-case

richardliaw Apr 3, 2024

Choose a reason for hiding this comment

richardliaw Apr 3, 2024

Choose a reason for hiding this comment

richardliaw Apr 3, 2024

Choose a reason for hiding this comment

richardliaw Apr 3, 2024

Choose a reason for hiding this comment

richardliaw Apr 3, 2024

Choose a reason for hiding this comment

richardliaw Apr 3, 2024

Choose a reason for hiding this comment