Playwright-powered modulable news crawler.
If you need web security consulting to avoid scraping, contact me on berwick.fr !
Supported newspaper | CRAWLER_SOURCE | Requires account |
---|---|---|
Le Monde | lemonde |
Yes (Premium recommended) |
Reuters | reuters |
No |
News-Crawler browses articles from newspaper websites and store them in a SQLite database :
- Source
- URL
- Title
- Headline
- Article
- Author
- Images
- Publication date
- Language
-
Copy and fill your credentials in
.env
:git clone https://github.com/flavienbwk/news-crawler && cd news-crawler cp .env.example .env
Edit
CRAWLER_SOURCE
,CRAWLER_EMAIL
andCRAWLER_PASSWORD
matching your newspaper credentials -
Run container
docker-compose run crawler
Requires Python >= 3.7 and pip installed
-
Install dependencies
git clone https://github.com/flavienbwk/news-crawler && cd news-crawler pip3 install -r requirements.txt
-
Run CLI
CRAWLER_SOURCE='...' CRAWLER_EMAIL='...' CRAWLER_PASSWORD='...' python3 ./scripts/crawler.py
Name | Type | Description |
---|---|---|
CRAWLER_SOURCE | str | Slug corresponding to the crawler to use (e.g: lemonde , reuters ) |
CRAWLER_EMAIL | str | Newspaper email address |
CRAWLER_PASSWORD | str | Newspaper password |
START_LINK | str | After login, start scraping articles from this page |
RETRIEVE_RELATED_ARTICLE_LINKS | bool | Crawl links in currently scraped article pointing to other similar articles |
RETRIEVE_EACH_ARTICLE_LINKS | bool | Crawl all article links present in the currently scraped article |
- Init : Playwright and flow initialization (browser context, logging)
- Logins : manage login strategy to website with periodic login check capabilities
- Crawlers : manage crawling strategy and yield {
models.Article
andmodels.Media
objects} - Spiders : retrieve data from webpage
- Persister : manage database persistance strategy
We recommend you to copy ./news-crawler/crawlers/lemonde
to ./news-crawler/crawlers/yourcrawler
. Now edit :
Login.py
Crawler.py
Spider.py
Don't forget to add a reference to your crawler in ./news-crawler/crawlers/__init__.py
.
├── database # Directory where article database and login cookies are saved
├── docker-compose.yml
├── Dockerfile
├── logs # Directory where logs are saved
├── news-crawler # Source directory
│ ├── crawlers
│ │ ├── Crawler.py # Abstract class containing helper functions to re-implement a crawler
│ │ ├── __init__.py # Links CRAWLER_SOURCE to a crawling flow
│ │ ├── lemonde # Implementation of a News Crawler flow for Le Monde
│ │ │ ├── Crawler.py # Manages crawling strategy (which links to visit)
│ │ │ ├── Login.py # Manages login to website strategy
│ │ │ └── Spider.py # Scraps article data from provided link
│ │ ├── Login.py # Abstract class containing helper functions to re-implement a login strategy
│ │ ├── Spider.py # Abstract class containing helper functions to re-implement a scraping strategy
│ │ └── Types.py # Nice typing variables
│ ├── main.py # Main script to run News Scraper flow
│ ├── models # Database models
│ └── utils # Helper functions and classes inside
│ ├── crawl.py # Crawl helper functions
│ ├── Database.py # Database management class
│ ├── hash.py # Hash-related helper functions
│ ├── Logger.py # Logging class
│ └── Persister.py # Class to manage persistance strategy
├── README.md
└── requirements.txt # Python requirements strategy