A collection of encoded archival description XML documents for text and content analysis.
-
Updated
Jun 6, 2024 - Shell
A collection of encoded archival description XML documents for text and content analysis.
FluCoMa's Learn Platform
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Kanji usage frequency data collected from various sources
Thai News Dataset from Thai government website.
📑 Galician corpus for misogyny detection
Voice activity detection and speaker gender segmentation audiovisual corpus
This directory contains PDFs to train both humans & models in discussing cyber threats and threat landscapes.
HausaHate is a benchmark dataset for Hausa hate speech detection task. it was extracted from West African Facebook pages and comprises 2,000 comments annotated according to a binary class (offensive and non-offensive) and hate speech targets (race, gender and none).
A very simple news crawler with a funny name
BlackLab Frontend, a feature-rich corpus search interface for BlackLab.
Linguistic search for large annotated text corpora, based on Apache Lucene
粵文語料篩選器 Cantonese text filter
ParlaMint: Comparable Parliamentary Corpora
Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.
Add a description, image, and links to the corpus topic page so that developers can more easily learn about it.
To associate your repository with the corpus topic, visit your repo's landing page and select "manage topics."