Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP/Do Not Review]Chatbot Recipe #403

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

HamidShojanazeri
Copy link
Contributor

Main Goal

Building an e2e recipe for building chatbots.

High level idea

We want to focus on following stages:

  • Data pipelines for creating datasets for chatbots
  • Data processing/ quality assurance practices/pipelines/tooling
  • Evaluation process
  • Fine-tuning a model/ Best practices for fine-tuning/ LORA/QLORA/ hyper params.

Use-case

  • Llama FAQ model using OSS llama docs, github docs, papers, website, etc.
  • Proposed data pipeline, using Llama 70b or 13b as the teacher model to create Q&A pairs from Llama docs as mentioned above.[ Open to any other ideas here]
  • Data Quality/ Eval using same teacher model [ Open to any other ideas here]


- **Navigating Dataset Limitations**: The perfect dataset for a specific task may not exist. Be mindful of the limitations when choosing from available resources, and understand the potential impact on your project.

#### **Best Practices for FineTuning Data Preparation**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this is a bit high level; I think a more concrete walkthrough would be much more useful


Gathering a diverse and comprehensive dataset is crucial. This dataset should include a wide range of topics and conversational styles to ensure the model can handle various subjects. A recent [research](https://arxiv.org/pdf/2305.11206.pdf) shows that quality of data has far more importance than quantity. Here are some high level thoughts on data collection and preprocessing along with best practices:

**NOTE** data collection and processing is very use-case specific and here we can only share best practices but it would be very nuanced for each use-case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer a more specific use case -- like what does "diversity" mean?

**Tools**

- [wimbd](https://github.com/allenai/wimbd) for data analysis.
- TBD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lilac?
wandb?


**Data Decontamination**

The process involves eliminating evaluation data from the training dataset. This crucial preprocessing step maintains the accuracy of model evaluation, guaranteeing that performance metrics are trustworthy and not skewed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you actually do this in practice? is it like data dedup?


##### Low Quality Dataset

Below are some examples of real test on the fine-tuned model with very poor results. It seems fine-tuned model does not show any promising results with this dataset. Looking at the dataset, we could observe that the amount of data (Q&A pair) for each concept such as PyTorch FSDP and Llama-Recipe is very limited and almost one pair per concept. This shows lack of relevant training data. The recent research showed that from each taxonomy having 2-3 examples can yield promising results.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what makes the data low-quality?

how do i detect this for my own data?


# Please implement your own chat service class here.
# The class should inherit from the ChatService class and implement the execute_chat_request_async method.
class OctoAIChatService(ChatService):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just use one of those generic openai clients like litellm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants