metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

metawarc (pronounced me-ta-warc) is a command line WARC files processing tools. Its goal is to make CLI interaction with files inside WARC archives so easy as possible. It provides a simple metawarc command that allows to extract metadata from images, documents and other files inside WARC archives.

Contents

1 Main features
2 File formats supported
3 Installation
- 3.1 Any OS
- 3.2 Python version
4 Usage
- 4.1 Examples
5 Commands
- 5.1 Metadata command
- 5.2 Analyze command
- 5.3 Index command
- 5.4 Stats command
- 5.5 Export command
- 5.6 List command
- 5.7 Dump command

1 Main features

Built-in WARC support
Metadata extraction for a lot of file formats
Low memory footprint
Documentation
Test coverage

2 File formats supported

MS Office OLE: .doc, .xls, .ppt
MS Office XML: .docx, .xlsx, .pptx
Adobe PDF: .pdf
Images: .png, .jpg, .tiff, .jpeg, .jp2

3 Installation

3.1 Any OS

A universal installation method (that works on Windows, Mac OS X, Linux, …, and always provides the latest version) is to use pip:

# Make sure we have an up-to-date version of pip and setuptools:
$ pip install --upgrade pip setuptools

$ pip install --upgrade metawarc

(If pip installation fails for some reason, you can try easy_install metawarc as a fallback.)

3.2 Python version

Python version 3.6 or greater is required.

4 Usage

Synopsis:

$ metawarc [command] [flags]  inputfile

See also metawarc --help and metawarc [command] --help for help for each command.

4.1 Examples

Extract metadata of all supported file types from 'digital.gov.ru.warc.gz' and output results to default filename 'metadata.jsonl':

$ metawarc metadata digital.gov.ru.warc.gz

Extract metadata for .doc and .docx file types from 'digital.gov.ru.warc.gz' and output results to default filename 'metadata.jsonl':

$ metawarc metadata --filetypes doc,docx digital.gov.ru.warc.gz

Extract metadata for .doc and .docx file types from 'digital.gov.ru.warc.gz' and output results to filename 'digital_meta.jsonl':

$ metawarc metadata --filetypes doc,docx --output digital_meta.jsonl digital.gov.ru.warc.gz

5 Commands

5.1 Metadata command

Extracts metadata from files inside .warc files. Returns JSON lines output for each file found.

Extract metadata for .doc and .docx file types from 'digital.gov.ru.warc.gz' and output results to filename 'digital_meta.jsonl':

$ metawarc metadata --filetypes doc,docx --output digital_meta.jsonl digital.gov.ru.warc.gz

5.2 Analyze command

Returns list of mime mimetypes with stats as number of files and total files size for each mime type. Will be merged or replaced by 'stats' command that uses sqlite db to speed up data processing

Analyzes 'digital.gov.ru.warc.gz' and output results of list of mime types as table to console

$ metawarc analyze digital.gov.ru.warc.gz

5.3 Index command

Generates 'metawarc.db' SQLite database with records HTTP metadata. Requred for 'stats' command to calculate stats quickly

Analyzes 'digital.gov.ru.warc.gz' and writes 'metawarc.db' with HTTP metadata.

$ metawarc index digital.gov.ru.warc.gz

5.4 Stats command

Same as 'analyze' command but uses 'metawarc.db' to speed up data processing. Returns total length and count of records by each mime or file extension.

Processes data in 'metawarc.db' and prints total length and count for each mime

$ metawarc stats -m mimes

Processes data in 'metawarc.db' and prints total length and count for each file extension

$ metawarc stats -m exts

5.5 Export command

Extracts HTTP headers, WARC headers or text content from WARC file and saves as NDJSON (JSON lines) data file.

Exports http headers from 'digital.gov.ru.warc.gz' and writes as 'headers.jsonl'

$ metawarc export -t headers -o headers.jsonl digital.gov.ru.warc.gz

Exports WarcIO index from 'digital.gov.ru.warc.gz' and writes as 'data.jsonl' with fields listed in '-f' option.

$ metawarc export -t warcio -f offset,length,filename,http:status,http:content-type,warc-type,warc-target-uri -o data.jsonl digital.gov.ru.warc.gz

Exports text (HTML) content from 'digital.gov.ru.warc.gz' and writes as 'content.jsonl'

$ metawarc export -t content -o content.jsonl digital.gov.ru.warc.gz

5.6 List command

Prints list of records with id, offset, length and url using 'metawarc.db'. Accepts list of mime types or list of file extensions or query as WHERE clause

Prints all records with mime type (content type) 'application/zip'

$ metawarc list -m 'application/zip'

Prints all records with file extensions 'xls' and 'xlsx'

$ metawarc list -e xls,xlsx

Prints all records with size greater than 10M and file extension 'pdf'

$ metawarc list -q 'content_length > 10000000 and ext = "pdf"'

5.7 Dump command

Dumps records payloads as files using 'metawarc.db' as WARC index. Accepts list of mime types or list of file extensions or query as WHERE clause. Adds CSV file 'records.csv' to the output directory with basic data about each dumped record.

Dumps all records with mime type (content type) 'application/zip' to 'allzip' directory

$ metawarc dump -m 'application/zip' -o allzip

Dumps all records with file extensions 'xls' and 'xlsx' to 'sheets' directory

$ metawarc dump -e xls,xlsx -o sheets

Dumps all records with size greater than 10M and file extension 'pdf' to 'bigpdf' directory

$ metawarc dump -q 'content_length > 10000000 and ext = "pdf"' -o 'bigpdf'

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.idea		.idea
metawarc		metawarc
.coveragerc		.coveragerc
.editorconfig		.editorconfig
.gitignore		.gitignore
AUTHORS.rst		AUTHORS.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
README.rst		README.rst
flake8		flake8
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

License

datacoon/metawarc

Folders and files

Latest commit

History

Repository files navigation

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

About

Topics

Resources

License

Stars

Watchers

Forks

Languages