Fine-tuning GPT-3 for Synthetic Danish News Generation

This repository contains the code written for the paper titled, "Fine-tuning GPT-3 for Synthetic Danish News Generation" (Almasi & Schiønning, 2023).

The project involved fine-tuning GPT-3 to produce synthetic news articles in Danish and evaluating the model in binary classification tasks. The evaluation relied on both human participants (A) and machine classifiers (B).

To read the details of this evaluation, please refer to (Almasi & Schiønning, 2023).

Reproducibility

Due to constraints with copyright and GDPR, only the test data and the synthetically generated GPT-3 data is uploaded to this GitHub repository. For all other purposes, dummy data is provided to reproduce the pipelines (see also Project Structure). To run any of the pipelines, follow the instructions in the Pipeline section.

For any other questions regarding the project, please contact the authors.

Project Structure

The repository is structured as such:

	Description
`dummy_data`	Dummy data to run GPT-3 pipeline, reproduce plots from `experiment A` (human participants) and technical pipelines from `experiment B` (machine classifiers). Created to mimic actual data to the extent that is possible.
`dummy_results`	Files that come from running dummy scripts in `src`. Due to limited dummy data, these may not contain any intelligible information.
`data`	Contains the `96` test articles used in both Experiment A and B (i.e., for evaluating both human participants and machine detectors) and the `609` articles generated by GPT-3 for fine-tuning BERT.
`plots`	Plots used in (Almasi & Schiønning, 2023)
`results`	Results from machine classifiers presented in (Almasi & Schiønning, 2023)
`src`	All code organised in folders `process_articles`, `gpt3` and `classifiers`
`tokens`	Empty folder to place `openai_token.txt` (for GPT-3 pipeline) and `hf_token.txt` (to push model to HF Hub, OPTIONAL!!!)
`setup.sh`	Run to install general requirements, packages in virtual environment. Note that additional setup may be required for the individual pipelines.
`simple_classifier.sh`	Run to reproduce classifier pipelines
`bert_classifier.sh`	Run to reproduce BERT pipeline

Please note that the files in results, plots and data contain actual data pertaining to (Almasi & Schiønning, 2023) while the files in dummy_data and dummy_results do not.

Pipeline

For this project, Python (version 3.10) and R was used. Python's venv needs to be installed for the setup to work.

General setup

To install necessary requirements in a virtual environment (env), please run the setup.sh in the terminal:

bash setup.sh

The individual technical pipelines may require extra setup. These steps are explained in their respective README's.

[1] Article Preprocessing

Refer to README.md located in src/process_articles to reproduce the article preprocessing.

[2] Fine-Tuning and Text Generation with GPT-3

To fine-tune and/or generate text with GPT-3 with dummy data, refer to the README.md located in src/gpt3.

⚠️ NOTE! The current script finetunes "text-davinci", but this will be deprecated on the 4th of January 2024. You can read more on about this at https://openai.com/blog/gpt-4-api-general-availability.

[3] Experiment A: Analysis of Human Participants

To run the analysis, please refer to the Rmarkdown exp-a-analysis.Rmd in the src folder.

[4] Experiment B: Constructing Machine Classifiers

To construct the machine classifiers (BOW, TF-IDF, fine-tuned BERT), follow the instructions in the README.md located in src/classifiers.

⚠️ NOTE! While the fine-tuning of NbAiLab/nb-bert-large is done on dummy data, the inference is done with the actual fine-tuned classifier on the real test data.

The fine-tuned BERT can be accessed from the Hugging Face Hub:

MinaAlmasi/dknews-NB-BERT-AI-classifier

Authors

For any questions regarding the paper or reproducibility of the project, you can contact us:

drasbaek@post.au.dk (Anton Drasbæk Schiønning)
mina.almasi@post.au.dk (Mina Almasi)

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
data		data
dummy_data		dummy_data
dummy_results		dummy_results
plots		plots
results		results
src		src
tokens		tokens
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
bert_classifier.sh		bert_classifier.sh
git-lfs-setup.sh		git-lfs-setup.sh
requirements.txt		requirements.txt
setup.sh		setup.sh
simple_classifiers.sh		simple_classifiers.sh

License

drasbaek/finetuning-gpt3-danish-news

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning GPT-3 for Synthetic Danish News Generation

Reproducibility

Project Structure

Pipeline

General setup

[1] Article Preprocessing

[2] Fine-Tuning and Text Generation with GPT-3

[3] Experiment A: Analysis of Human Participants

[4] Experiment B: Constructing Machine Classifiers

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Languages