Skip to content

Scalable Protein Language Model Finetuning with Distributed Learning and Advanced Training Techniques such as LoRA.

License

Notifications You must be signed in to change notification settings

naity/finetune-esm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo

Finetune-ESM

Scalable Protein Language Model Finetuning with Distributed Learning and Advanced Training Techniques such as LoRA.


Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

This project explores scalable and efficient finetuning of protein language models like ESM-2 using advanced training techniques like FSDP (Fully Sharded Data Parallel) and LoRA (Low-Rank Adaptation).

Highlights

  • Distributed training: Leverage distributed computing for finetuning large protein language models on multiple GPUs.
  • Advanced techniques: Explore LoRA and other methods to improve finetuning efficiency and performance.
  • Reproducibility: Track and manage finetuning experiments using tools like MLflow.

(back to top)

Built With

  • Python
  • Pytorch
  • ESM
  • Lightning
  • Ray
  • MLflow
  • Transformers

(back to top)

Getting Started

  1. Clone the repo:
git clone https://github.com/naity/finetune-esm.git
  1. Run the train.py script to see a list of available parameters:
python finetune-esm/train.py --help

Prerequisites

The requirements.txt file lists the Python packages that need to be installed in order to run the scripts. Please use the command below for installation.

pip install -r requirements.txt

(back to top)

Usage

In this example, we will finetune ESM-2 for the CAFA 5 Protein Function Prediction Challenge to predict the biological function of a protein based on its primary sequence. I have already preprocessed the data and formatted the problem as a multi-class, multi-label problem. This means that for a given protein sequence, we will predict whether it is positive for each of the 100 preselected Gene Ontology (GO) terms. Thus, the target for each protein sequence is a binary vector with a length of 100.

The processed datasets can be downloaded from here. Details about the preprocessing steps can be found in the notebooks/cafa5_data_processing.ipynb notebook.

Run the following example command to finetune ESM-2 models with the processed datasets. Here, we are using the smallest model esm2_t6_8M_UR50D with 1 GPU and the LoRA approach. If you want to finetune a larger model and have multiple GPUs, please adjust num_workers and/or num-devices accordingly.

python finetune_esm/train.py \
  --experiment-name esm2_t6_8M_UR50D_lora \
  --dataset-loc data/cafa5/top100_train_split.parquet \
  --targets-loc data/cafa5/train_bp_top100_targets.npy \
  --esm-model esm2_t6_8M_UR50D \
  --num-workers 1 \
  --num-devices 1 \
  --training-mode lora \
  --learning-rate 0.0001 \
  --num-epochs 5

Once training is done, we can use MLflow to view the experiment using:

mlflow server --host 127.0.0.1 --port 8080 --backend-store-uri ./finetune_results/mlflow

Below are screenshots of the example experiment. We can view parameters, artifacts, and visualize the metrics results.

MLflow Result 1 MLflow Result 2

(back to top)

Roadmap

  • Data Processing
  • Training
  • Serving

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

ytiancompbio ytiancompbio @yuan_tian ytiancompbio

(back to top)

Acknowledgments

(back to top)

About

Scalable Protein Language Model Finetuning with Distributed Learning and Advanced Training Techniques such as LoRA.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published