SFT LLM News Articles Telugu

This repository contains a collection of Python, Node.js, and Jupyter Notebook files for the creation of Telugu News articles Instruct-Style dataset for the puporse of supervised fine-tuning of Large Language Model (LLM).

Telugu News Articles dataset is created using the code in this repository and opensourced as HuggingFace Datasets under Apache 2.0 Licence. You can access the dataset here: aya-telugu-news-articles.

The repository is beneficial for users who wants to:

Reproduce the Telugu News Articles dataset creation workflow.
Extend the existing Telugu News Articles dataset.
Integrate the parts of Telugu News Articles dataset creation workflow into their own dataset creation workflow.

Note: Scraping copyrighted website without permission is unethical and not advisable. Please check the terms and conditions of scraping a website before proceeding with the workflow.

Installation

Python

Make sure you have Python version 3.9.13 or higher installed. You can check your Python version by running:

python --version

If you don't have Python installed or have an older version, you can download the latest version from the official Python website: https://www.python.org

Virtual Environment

It is recommended to create a virtual environment to isolate the project dependencies. To create a virtual environment, run:

python -m venv venv

Activate the virtual environment:

For Windows:
```
venv\Scripts\activate
```
For macOS and Linux:
```
source venv/bin/activate
```

Dependencies

To install the required dependencies for the Python files in the virtual environment, run:

pip install -r requirements.txt

Node.js

Make sure you have Node.js version 18.13.0 or higher installed. You can check your Node.js version by running:

node --version

If you don't have Node.js installed or have an older version, you can download the latest version from the official Node.js website: https://nodejs.org

To install the required dependencies for the Node.js files, run:

npm install

Usage

The workflow for the dataset creation consists of following three steps which needs to performed sequentially.

1. Scraping

Edit the src/utils/scraper-constants.js file according to your specifications like timeout, links to be scraped etc.
To scrape the content specified in previous step, run:
```
node index.js
```
After successful execution, you can find the scraped content JSON file located in the SCRAPED_CONTENT_FILE_PATH mentioned in scraper-constants.js file.

2. Exploratory Data Analysis

Edit the src/utils/sft_constants.py file according to your specifications.
Run the notebooks/exploratory_data_analysis.ipynb notebook.
The notebook has detailed steps which performs exploratory data analysis, dataset cleaning and removal of outliers.
After successful execution of notebook, you can find the cleaned scraped content csv file located in FINAL_SCRAPED_DATASET_PATH mentioned in sft_constants.py.

3. SFT Dataset Creation

Edit the src/utils/sft_constants.py file according to your specifications.
To create the Instruct-Style sft dataset from the scraped content, run:
```
python main.py
```
After successful execution, you can find the final sft dataset with prompts and completions located in SFT_DATASET_PATH mentioned in sft_constants.py.

Repository Structure

The repository has the following structure:

├── src/
│   ├── python/
│   │   ├── sft.py
│   │   ├── post_processor.py
│   │   └── ...
│   ├── nodejs/
│   │   ├── content-scraper.js
│   │   ├── links-scraper.js
│   │   └── ...
│   ├── data/
│   │   ├── content/tmp
│   │   └── links/
│   └── utils/
│       ├── scraper-constants.js
│       └── sft_constants.py
├── notebooks/
│   └── exploratory_data_analysis.ipynb
├── index.js
├── main.py
├── requirements.txt
├── package.json
└── README.md

The src/ directory contains the main source code files.
- The python/ directory contains the Python files.
- The nodejs/ directory contains the Node.js files.
- The data/ directory contains data files.
  - The content/ directory contains content-related data files.
  - The links/ directory contains link-related data files.
- The utils/ directory contains utility files for both python and nodejs.
The notebooks/ directory contains the notebook to do exploratory data analysis.
The main.py file creates the instruct-style sft dataset.
The index.js file scrapes the content from the website.
The requirements.txt file lists the Python dependencies.
The package.json file lists the Node.js dependencies.

Contributing

Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

When contributing, please follow the existing code style and conventions used in the project.

License

This code is licensed under the MIT License.

Citation

If you use this code in your work, please cite it as follows:

@software{Guthikonda_SFT_LLM_News_2024,
author = {Guthikonda, Surya},
license = {MIT},
month = apr,
title = {{SFT LLM News Articles Telugu}},
url = {https://github.com/SuryaKrishna02/sft-llm-news-articles-telugu},
version = {1.0.0},
year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
citation.cff		citation.cff
index.js		index.js
main.py		main.py
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

License

SuryaKrishna02/sft-llm-news-articles-telugu

Folders and files

Latest commit

History

Repository files navigation

SFT LLM News Articles Telugu

Table of Contents

Installation

Python

Virtual Environment

Dependencies

Node.js

Usage

1. Scraping

2. Exploratory Data Analysis

3. SFT Dataset Creation

Repository Structure

Contributing

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages