Skip to content

🎙️ TED Talks web scraper

License

Notifications You must be signed in to change notification settings

corralm/ted-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[Deprecated] Use TEDscraper2 instead.

TEDscraper

Scrape TED talk data including transcripts in over 100 languages from TED.com

Requirements

Python 3
Beautiful Soup 4
fake-useragent
lxml
Pandas
Requests

Usage

# move to TEDscraper directory
# import module (or use Jupyter Notebook)
from TEDscraper import TEDscraper

# instantiate the scraper & pass in optional arguments
scraper = TEDscraper(lang_code='en', urls='all', topics='all')

# scrape the data and save it to a dictionary
ted_dict = scraper.get_data()

# transform the dictionary to a sorted pandas DataFrame
df = scraper.to_dataframe(ted_dict)

# output DataFrame as CSV
df.to_csv('../data/ted_talks.csv', index=False)

Here is a list of other output formats Pandas docs.

Parameters

  • lang_code
    • English is the default language lang_code='en'
    • You can pass in other language codes using the lang_code param
    • TED translators don't always translate all features
      • Ex: Title and 'About Speaker' might be in English while the transcript is translated to French
  • urls
    • All urls are scraped by default for the selected language urls='all'
    • You may pass in a list of urls. However, there are a few limitations:
      • TED must have the talks available in the language you specify
      • Only one language can be provided per scrape call
  • topics
    • All topics are scraped by default topics='all'
    • You may pass in a list of topics to filter by them
  • force_fetch
    • Talks with known issues are skipped by default force_fetch=False
    • Set it to 'True' to attempt to scrape
    • See talks with known issues
  • exclude_transcript
    • All features are scraped by default exclude_transcript=False
    • Set it to 'True' to exclude the transcript

Attributes

Attribute Description Data Type
talk_id Talk identification number provided by TED int
title Title of the talk string
speaker_1 First speaker in TED's speaker list string
speakers Speakers in the talk dictionary
occupations *Occupations of the speakers dictionary
about_speakers *Blurb about each speaker dictionary
views Count of views int
recorded_date Date the talk was recorded string
published_date Date the talk was published to TED.com string
event Event or medium in which the talk was given string
native_lang Language the talk was given in string
available_lang All available languages (lang_code) for a talk list
comments Count of comments int
duration Duration in seconds int
topics Related tags or topics for the talk list
related_talks Related talks (key='talk_id', value='title') dictionary
url URL of the talk string
description Description of the talk string
transcript Full transcript of the talk string

*The dictionary key maps to the speaker in ‘speakers’.

Languages

TED talks have been subtitled in over 100 languages. Here are the top languages:

Code Language
en English
es Spanish
pt-br Portuguese (Brazilian)
fr French
it Italian
zh-cn Chinese (simplified)
zh-tw Chinese (traditional)
ko Korean
ja Japanese
tr Turkish
ru Russian
he Hebrew

Here is a link to all language codes available as of May 2020.

You can see all the talks for each language at TED – Our Languages.

Meta

Author: Miguel Corral Jr.
Email: corraljrmiguel@gmail.com
LinkedIn: https://www.linkedin.com/in/iMiguel
GitHub: https://github.com/corralm

Distributed under the MIT license. See LICENSE for more information.