An Awesome List for getting started with web archiving
-
Updated
May 8, 2024
An Awesome List for getting started with web archiving
Wayback Machine API interface & a command-line tool
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Parse And Create Web ARChive (WARC) files with node.js
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
A dockerized, queued high fidelity web archiver based on Squidwarc
A list of things related to software, literature, and other content for 🕣 Memento
Various Jupyter notebooks about Common Crawl data
An archival thumbnail visualization server
Quick Cache and Archive search buttons
Decentralized web archiving
Digital Preservation of HTTP in documentary heritage.
Parser for WARC (aka WebArchive) files
Awesome list dedicated to digital and data preservation tools, sources, services and so on.
Seeder - Czech webarchive curating tool and public site
A social media open post web archiving tool
News Archiver, Data Aggregation for CNN and Fox News
Parse CDXJ(https://github.com/oduwsdl/ORS/wiki/CDXJ) files with node.js
Makes saving pages in bulk to the wayback machine much easier
Add a description, image, and links to the webarchiving topic page so that developers can more easily learn about it.
To associate your repository with the webarchiving topic, visit your repo's landing page and select "manage topics."