Skip to content

Code repository for "Fine-tuning GPT-3 for Synthetic Danish News Generation" (Almasi & Schiønning, 2023) @MinaAlmasi @drasbaek

License

Notifications You must be signed in to change notification settings

drasbaek/finetuning-gpt3-danish-news

Repository files navigation

Fine-tuning GPT-3 for Synthetic Danish News Generation

This repository contains the code written for the paper titled, "Fine-tuning GPT-3 for Synthetic Danish News Generation" (Almasi & Schiønning, 2023).

The project involved fine-tuning GPT-3 to produce synthetic news articles in Danish and evaluating the model in binary classification tasks. The evaluation relied on both human participants (A) and machine classifiers (B).

To read the details of this evaluation, please refer to (Almasi & Schiønning, 2023).

Reproducibility

Due to constraints with copyright and GDPR, only the test data and the synthetically generated GPT-3 data is uploaded to this GitHub repository. For all other purposes, dummy data is provided to reproduce the pipelines (see also Project Structure). To run any of the pipelines, follow the instructions in the Pipeline section.

For any other questions regarding the project, please contact the authors.

Project Structure

The repository is structured as such:

Description
dummy_data Dummy data to run GPT-3 pipeline, reproduce plots from experiment A (human participants) and technical pipelines from experiment B (machine classifiers). Created to mimic actual data to the extent that is possible.
dummy_results Files that come from running dummy scripts in src. Due to limited dummy data, these may not contain any intelligible information.
data Contains the 96 test articles used in both Experiment A and B (i.e., for evaluating both human participants and machine detectors) and the 609 articles generated by GPT-3 for fine-tuning BERT.
plots Plots used in (Almasi & Schiønning, 2023)
results Results from machine classifiers presented in (Almasi & Schiønning, 2023)
src All code organised in folders process_articles, gpt3 and classifiers
tokens Empty folder to place openai_token.txt (for GPT-3 pipeline) and hf_token.txt (to push model to HF Hub, OPTIONAL!!!)
setup.sh Run to install general requirements, packages in virtual environment. Note that additional setup may be required for the individual pipelines.
simple_classifier.sh Run to reproduce classifier pipelines
bert_classifier.sh Run to reproduce BERT pipeline

Please note that the files in results, plots and data contain actual data pertaining to (Almasi & Schiønning, 2023) while the files in dummy_data and dummy_results do not.

Pipeline

For this project, Python (version 3.10) and R was used. Python's venv needs to be installed for the setup to work.

General setup

To install necessary requirements in a virtual environment (env), please run the setup.sh in the terminal:

bash setup.sh

The individual technical pipelines may require extra setup. These steps are explained in their respective README's.

[1] Article Preprocessing

Refer to README.md located in src/process_articles to reproduce the article preprocessing.

[2] Fine-Tuning and Text Generation with GPT-3

To fine-tune and/or generate text with GPT-3 with dummy data, refer to the README.md located in src/gpt3.

⚠️ NOTE! The current script finetunes "text-davinci", but this will be deprecated on the 4th of January 2024. You can read more on about this at https://openai.com/blog/gpt-4-api-general-availability.

[3] Experiment A: Analysis of Human Participants

To run the analysis, please refer to the Rmarkdown exp-a-analysis.Rmd in the src folder.

[4] Experiment B: Constructing Machine Classifiers

To construct the machine classifiers (BOW, TF-IDF, fine-tuned BERT), follow the instructions in the README.md located in src/classifiers.


⚠️ NOTE! While the fine-tuning of NbAiLab/nb-bert-large is done on dummy data, the inference is done with the actual fine-tuned classifier on the real test data.


The fine-tuned BERT can be accessed from the Hugging Face Hub:

MinaAlmasi/dknews-NB-BERT-AI-classifier

Authors

For any questions regarding the paper or reproducibility of the project, you can contact us:

About

Code repository for "Fine-tuning GPT-3 for Synthetic Danish News Generation" (Almasi & Schiønning, 2023) @MinaAlmasi @drasbaek

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published