corpus

HausaHate is a benchmark dataset for Hausa hate speech detection task. it was extracted from West African Facebook pages and comprises 2,000 comments annotated according to a binary class (offensive and non-offensive) and hate speech targets (race, gender and none).

benchmark machine-learning natural-language-processing corpus dataset nlp-machine-learning offensive-language hate-speech low-resource-languages hausa-nlp

Updated Jun 6, 2024

esteeschwarz / SPUND-LX

Star

linguistics essais

corpus linguistics

Updated Jun 5, 2024
HTML

flairNLP / fundus

Star

A very simple news crawler with a funny name

python nlp rss sitemap crawler scraper corpus text-extraction web-scraping news-crawler commoncrawl web-corpus news-scraping cc-news

Updated Jun 5, 2024
Python

INL / corpus-frontend

Star

BlackLab Frontend, a feature-rich corpus search interface for BlackLab.

corpus

Updated Jun 6, 2024
TypeScript

INL / BlackLab

Star

Linguistic search for large annotated text corpora, based on Apache Lucene

corpus

Updated Jun 5, 2024
Java

CanCLID / canto-filter

Star

粵文語料篩選器 Cantonese text filter

nlp data corpus cantonese corpus-data cantonese-language

Updated Jun 4, 2024
Python

clarin-eric / ParlaMint

Star

ParlaMint: Comparable Parliamentary Corpora

corpus tei-xml parliamentary-data

Updated Jun 4, 2024
XSLT

SaiedAlshahrani / leveraging-corpus-metadata

Star

Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition

metadata translation wikipedia corpus arabic egyptian detection-systems template-based-translation

Updated Jun 4, 2024
Jupyter Notebook

sparkfish / shabby-pages

Sponsor

Star

ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.

data-science computer-vision corpus dataset binarization denoising layout-detection born-digital

Updated Jun 6, 2024
Jupyter Notebook

Superar / Puntuguese

Star

nlp natural-language-processing corpus corpus-linguistics portuguese portuguese-language humor-detection portuguese-brazilian humor-classification portuguese-european

Updated Jun 4, 2024
Python

Improve this page

Add a description, image, and links to the corpus topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the corpus topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus

Here are 851 public repositories matching this topic...

jdave23 / EAD-corpus

flucoma / learn-website

adbar / trafilatura

divvun / CorpusTools

scriptin / kanji-frequency

DEFI-COLaF / Datasets_text

PyThaiNLP / thaigov-v2-corpus

luciamariaalvarezcrespo / GalMisoCorpus2023

ina-foss / InaGVAD

gertjanbruggink / threat-landscape-training-corpus

franciellevargas / HausaHate

esteeschwarz / SPUND-LX

flairNLP / fundus

INL / corpus-frontend

INL / BlackLab

CanCLID / canto-filter

clarin-eric / ParlaMint

SaiedAlshahrani / leveraging-corpus-metadata

sparkfish / shabby-pages

Superar / Puntuguese

Improve this page

Add this topic to your repo