🕷️ News crawler

Playwright-powered modulable news crawler.

If you need web security consulting to avoid scraping, contact me on berwick.fr !

Supported newspaper	CRAWLER_SOURCE	Requires account
Le Monde	`lemonde`	Yes (Premium recommended)
Reuters	`reuters`	No

⚠️ DISCLAIMER : This project is for educational purpose only ! Do NOT use it for any other intent. It was developed as a fun side-project to train my scraping skills.

Extraction

News-Crawler browses articles from newspaper websites and store them in a SQLite database :

Source
URL
Title
Headline
Article
Author
Images
Publication date
Language

Getting started

Docker

Copy and fill your credentials in .env :
```
git clone https://github.com/flavienbwk/news-crawler && cd news-crawler
cp .env.example .env
```
Edit CRAWLER_SOURCE, CRAWLER_EMAIL and CRAWLER_PASSWORD matching your newspaper credentials
Run container
```
docker-compose run crawler
```

CLI

Requires Python >= 3.7 and pip installed

Install dependencies

git clone https://github.com/flavienbwk/news-crawler && cd news-crawler
pip3 install -r requirements.txt

Run CLI

CRAWLER_SOURCE='...' CRAWLER_EMAIL='...' CRAWLER_PASSWORD='...' python3 ./scripts/crawler.py

Parameters

Name	Type	Description
CRAWLER_SOURCE	str	Slug corresponding to the crawler to use (e.g: `lemonde`, `reuters`)
CRAWLER_EMAIL	str	Newspaper email address
CRAWLER_PASSWORD	str	Newspaper password
START_LINK	str	After login, start scraping articles from this page
RETRIEVE_RELATED_ARTICLE_LINKS	bool	Crawl links in currently scraped article pointing to other similar articles
RETRIEVE_EACH_ARTICLE_LINKS	bool	Crawl all article links present in the currently scraped article

Development

Working principle flow

Init : Playwright and flow initialization (browser context, logging)
Logins : manage login strategy to website with periodic login check capabilities
Crawlers : manage crawling strategy and yield {models.Article and models.Media objects}
Spiders : retrieve data from webpage
Persister : manage database persistance strategy

Creating a new crawler

We recommend you to copy ./news-crawler/crawlers/lemonde to ./news-crawler/crawlers/yourcrawler. Now edit :

Login.py
Crawler.py
Spider.py

Don't forget to add a reference to your crawler in ./news-crawler/crawlers/__init__.py

Architecture

.
├── database                # Directory where article database and login cookies are saved
├── docker-compose.yml
├── Dockerfile
├── logs                    # Directory where logs are saved
├── news-crawler            # Source directory
│   ├── crawlers
│   │   ├── Crawler.py      # Abstract class containing helper functions to re-implement a crawler
│   │   ├── __init__.py     # Links CRAWLER_SOURCE to a crawling flow
│   │   ├── lemonde         # Implementation of a News Crawler flow for Le Monde
│   │   │   ├── Crawler.py  # Manages crawling strategy (which links to visit)
│   │   │   ├── Login.py    # Manages login to website strategy
│   │   │   └── Spider.py   # Scraps article data from provided link
│   │   ├── Login.py        # Abstract class containing helper functions to re-implement a login strategy
│   │   ├── Spider.py       # Abstract class containing helper functions to re-implement a scraping strategy
│   │   └── Types.py        # Nice typing variables
│   ├── main.py             # Main script to run News Scraper flow
│   ├── models              # Database models
│   └── utils               # Helper functions and classes inside
│       ├── crawl.py        # Crawl helper functions
│       ├── Database.py     # Database management class
│       ├── hash.py         # Hash-related helper functions
│       ├── Logger.py       # Logging class
│       └── Persister.py    # Class to manage persistance strategy
├── README.md
└── requirements.txt        # Python requirements strategy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

database

database

logs

logs

news-crawler

news-crawler

.env.example

.env.example

.gitignore

.gitignore

.travis.yml

.travis.yml

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

docker-compose.yml

docker-compose.yml

requirements.txt

requirements.txt

Repository files navigation

🕷️ News crawler

Extraction

Getting started

Docker

CLI

Parameters

Development

Working principle flow

Creating a new crawler

Architecture

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
database		database
logs		logs
news-crawler		news-crawler
.env.example		.env.example
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

License

flavienbwk/news-crawler

Folders and files

Latest commit

History

Repository files navigation

🕷️ News crawler

Extraction

Getting started

Docker

CLI

Parameters

Development

Working principle flow

Creating a new crawler

Architecture

About

Topics

Resources

License

Stars

Watchers

Forks

Languages