Y Combinator Directory Scraper

I built YC-Scraper to create a dataset of all the companies in the Y Combinator directory. You can search for companies by industry, region, company size, and more in this directory.

About Y Combinator

Y Combinator is a startup accelerator that has invested in over 4,000 companies that have a combined valuation of over $600B. The overall goal of Y Combinator is to help startups really take off.

Requirements

You must have Firefox and geckodriver installed. You can install geckodriver by running brew install geckodriver.

Python packages include:

Scrapy
Selenium
tqdm
Pandas (optional)

Usage

Clone this repository
Move to the yc-scraper directory
Run python yc_links_extractor.py. This will fetch the individual URLs for the spider to crawl.
Run scrapy runspider scrapy-project/ycombinator/spiders/yscraper.py -o output.jl. This generates a JSON lines file which you can read with Pandas:

import pandas as pd
df = pd.read_json('./output.jl', lines=True)

Dataset

Check out the dataset I published on Kaggle.com.

Attributes

Attribute	Description	Data Type
company_id	Company id provided by YC	int
company_name	Company name	string
short_description	One-line description of the company	string
long_description	Long description of the company	string
batch	Batch name provided by YC	string
status	Company status	string
tags	Industry tags	list
location	Company location	string
country	Company country	string
year_founded	Year the company was founded	int
num_founders	Number of founders	int
founders_names	Full names of the founders	list
team_size	Number of employees	int
website	Company website	string
cb_url	Company Crunchbase url	string
linkedin_url	Company LinkedIn url	string

Sample Data

Note: I excluded 'short_description', 'long_description', 'cb_url', and 'linkedin_url' in the sample data for brevity.

company_id	company_name	short_description	batch	status	tags	location	country	year_founded	num_founders	founders_names	team_size	website
240	Stripe	Economic infrastructure for the internet.	S09	Active	['Fintech', 'Banking as a Service', 'SaaS']	San Francisco	US		2	['John Collison', 'Patrick Collison']	7000	http://stripe.com
271	Airbnb	Book accommodations around the world.	W09	Public	['Travel', 'Marketplace']	San Francisco	US	2008	3	['Nathan Blecharczyk', 'Brian Chesky', 'Joe Gebbia']	6132	http://airbnb.com
325	Dropbox	Backup and share files in the cloud.	S07	Public	[]	San Francisco	US	2008	2	['Arash Ferdowsi', 'Drew Houston']	4000	http://dropbox.com
379	Reddit	The frontpage of the internet.	S05	Acquired	['Community', 'Social', 'Social Media', 'Social Network']	San Francisco	US		1	['Steve Huffman']	201	http://reddit.com
439	Coinbase	Buy, sell, and manage cryptocurrencies.	S12	Public	['Crypto / Web3']	San Francisco	US	2012	1	['Brian Armstrong']	6112	https://www.coinbase.com
531	DoorDash	Restaurant delivery.	S13	Public	['E-commerce', 'Marketplace']	San Francisco	US	2013	3	['Andy Fang', 'Stanley Tang', 'Tony Xu']	8600	http://doordash.com

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
scrapy-project		scrapy-project
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
yc_links_extractor.py		yc_links_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

scrapy-project

scrapy-project

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

yc_links_extractor.py

yc_links_extractor.py

Repository files navigation

Y Combinator Directory Scraper

About Y Combinator

Requirements

Usage

Dataset

Attributes

Sample Data

Meta

About

Languages

License

corralm/yc-scraper

Folders and files

Latest commit

History

Repository files navigation

Y Combinator Directory Scraper

About Y Combinator

Requirements

Usage

Dataset

Attributes

Sample Data

Meta

About

Topics

Resources

License

Stars

Watchers

Forks

Languages