Greek Words Evolution

Overview

A systematic framework that uses diachronic word embeddings to trace semantic shifts or variations in the context of words over time in the Greek language.

Research paper

This repository accompanies the paper "Studying the Evolution of Greek Words via Word Embeddings" by V. Barzokas, E. Papagiannopoulou and G. Tsoumakas, published in the proceedings of the 11th Hellenic Conference on Artificial Intelligence (SETN 2020) and contains the set of tools developed and data prepared for its needs. The paper is going to be available here https://doi.org/10.1145/3411408.3411425

If you use this code and/or data in your research please cite the following:

@inproceedings{10.1145/3411408.3411425,
	author 		= {Barzokas, Vasileios and 
			   Papagiannopoulou, Eirini and 
			   Tsoumakas, Grigorios},
	title 		= {Studying the Evolution of Greek Words via Word Embeddings},
	booktitle 	= {11th Hellenic Conference on Artificial Intelligence},
	pages 		= {118–124},			   
   	year 		= {2020},
	location 	= {Athens, Greece},
	series 		= {SETN 2020}
	isbn		= {9781450388788},
	publisher 	= {Association for Computing Machinery},
	address 	= {New York, NY, USA},
	url 		= {https://doi.org/10.1145/3411408.3411425},
	doi 		= {10.1145/3411408.3411425},
}

Example visualized result

Highlighted is the most relevant word and the rest are represented in descending order.

Requirements

Python 3.6.9
fastText - a library for efficient learning of word representations and sentence classification.

Installation

Clone this repository by running:

git clone git@github.com:intelligence-csd-auth-gr/greek-words-evolution.git

Clone the required fastText repository by running:
```
git submodule init
git submodule update
```
Install the fastText library for your system as described in its documentation that can be found here: https://github.com/facebookresearch/fastText

Note: Normally all that is required to do is:
```
 cd fastText
 make
 pip install .
```
Install the required Python libraries by running:
```
pip install -r requirements.txt
```

Running

First steps

If running for first time, create the text files per period by running:
```
python gws.py text --action exportByPeriod
```
Then create the models from those text files by running:
```
python gws.py model --action create
```

Later, after the models have been generated you can see the nearest neighbours of a word by running something similar to this example:

python gws.py model --action getNN --word ποντίκι --period 2010

output:

['ποντικι', 'φακα', 'πιασμενο', 'ταμπλετ', 'κατσαβιδι', 'μπιλη', 'γατα', 'ποντικοπαγιδα', 'αραχνη', 'βιντεοκασετα', 'κοριο', 'πληκτρολογιο', 'ποντικο', 'κλακ', 'κατεβασεις', 'μιξερ', 'ποντικακι', 'τσιπακι', 'μεγαλουτσικο', 'συνδεθω', 'μυγοσκοτωστρα']

Get the 10 words with the highest semantic change, based on their cosine distance:
```
python gws.py model --action getCD --fromYear 1980 --toYear 2020
```
Get the 10 words with the highest semantic change, based on their cosine similarity (opposite sorted list of cosine distance):
```
python gws.py model --action getCS --fromYear 1980 --toYear 2020
```

Options

The script accepts either of the following positional arguments:

website - allows actions on the websites, such as URL extraction, file downloading etc.
metadata - allows actions on the metadata, metadata display or export etc.
text - allows actions on the text, such as text extraction, metadata display or export etc.
model - allows actions on the trained models, such as the training, evaluation through nearest neighbours or shifts of word meanings through periods.

In order to see a full list of the available options and a short description of each one of them, type:

python gws.py --help

The snippets below display a brief description of each of the options that the positional arguments accept.

argument: website

usage: gws.py website [-h] [--target {openbook}]
                      [--action {fetchLinks,fetchMetadata,fetchFiles}]

optional arguments:
  -h, --help            show this help message and exit
  --target {openbook}   Target website to scrap data from
  --action {fetchLinks,fetchMetadata,fetchFiles}
                        The action to execute on the selected website

argument: metadata

usage: gws.py metadata [-h] [--corpus {all,openbook,project_gutenberg}]
                       [--action {printStandard,printEnhanced,exportEnhanced}]
                       [--fromYear FROMYEAR] [--toYear TOYEAR]
                       [--splitYearsInterval SPLITYEARSINTERVAL]

optional arguments:
  -h, --help            show this help message and exit
  --corpus {all,openbook,project_gutenberg}
                        The name of the target corpus to work with
  --action {printStandard,printEnhanced,exportEnhanced}
                        Action to perform against the metadata of the selected
                        text corpus
  --fromYear FROMYEAR   The target starting year to extract data from
  --toYear TOYEAR       The target ending year to extract data from
  --splitYearsInterval SPLITYEARSINTERVAL
                        The interval to split the years with and export the
                        extracted data

argument: text

usage: gws.py text [-h] [--corpus {all,openbook,project_gutenberg}]
                   [--action {combineByPeriod,extractFromPDF}]
                   [--fromYear FROMYEAR] [--toYear TOYEAR]
                   [--splitYearsInterval SPLITYEARSINTERVAL]

optional arguments:
  -h, --help            show this help message and exit
  --corpus {all,openbook,project_gutenberg}
                        The name of the target corpus to work with
  --action {combineByPeriod,extractFromPDF}
                        Action to perform against the selected text corpus
  --fromYear FROMYEAR   The target starting year to extract data from
  --toYear TOYEAR       The target ending year to extract data from
  --splitYearsInterval SPLITYEARSINTERVAL
                        The interval to split the years with and export the
                        extracted data

argument: model

usage: gws.py model [-h] [--action {create,getNN,getCS,getCD}] [--word WORD]
                    [--period PERIOD] [--textsFolder TEXTSFOLDER]
                    [--fromYear FROMYEAR] [--toYear TOYEAR]

optional arguments:
  -h, --help            show this help message and exit
  --action {create,getNN,getCS,getCD}
                        Action to perform against the selected model
  --word WORD           Target word to get nearest neighbours for
  --period PERIOD       The target period to load the model from
  --textsFolder TEXTSFOLDER
                        The target folder that contains the texts files
  --fromYear FROMYEAR   the target starting year to create the model for
  --toYear TOYEAR       the target ending year to create the model for

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
assets		assets
data		data
fastText @ 5b5943c		fastText @ 5b5943c
lib		lib
output		output
.gitignore		.gitignore
.gitmodules		.gitmodules
AUTHORS		AUTHORS
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
gws.py		gws.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg

License

intelligence-csd-auth-gr/greek-words-evolution

Folders and files

Latest commit

History

Repository files navigation

Greek Words Evolution

Overview

Research paper

Example visualized result

Requirements

Installation

Running

First steps

Options

argument: website

argument: metadata

argument: text

argument: model

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages