Skip to content

Python ETL Pipeline that web-scrapes up-to-date NBA data from multiple sources, then statistically analyzes and visualizes into multiple team, player and league-wide reports.

Notifications You must be signed in to change notification settings

petermartens98/NBA-Analytics-Pipeline

Repository files navigation

NBA-Analytics-Pipeline

Python ETL Pipeline that web-scrapes up-to-date NBA data from multiple sources, then statistically analyzes and visualizes into multiple team, player and league-wide reports, as well as predictivelly modelling and statistically correlating various stats.

Related Projects

Link to NBA Flask Applications repo: https://github.com/petermartens98/NBA-Flask-Applications

Link to NBA Shooting Heatmaps repo: https://github.com/petermartens98/NBA-Shooting-Heatmaps

Link to NBA Performance vs Salary Regression Anlaysis repo: https://github.com/petermartens98/Salary-and-Performance-Regression-Analysis-for-2012-to-2018-NBA-Data

File Desctiptions and Example Screenshots

NBA_Injuries_Webscraping.ipynb

Imports Utilized: Pandas, Selenium, BeautifulSoup

This is a Python function that scrapes the daily NBA injury report from the CBS Sports website and returns the data as a Pandas DataFrame. The function uses the BeautifulSoup and Selenium libraries to parse the HTML and interact with the website.

The function starts by setting some options for the Selenium webdriver, including running in headless mode (without opening a visible browser window). It then defines the URL to scrape and the location of the Chrome driver on the user's computer.

The function then creates a new webdriver instance, sets a page load timeout, and navigates to the specified URL. It retrieves the page source HTML and uses BeautifulSoup to find the sections of the page containing injury data for each team.

For each team, the function loops through the player injury data and creates a dictionary of the relevant fields (team, player name, position, injury, and status). It then appends this dictionary to a list of all player data for all teams.

Once all the data has been collected, the function quits the webdriver and returns the data as a sorted Pandas DataFrame, with one row per player injury.

NBA_Live_Scores_Webscraping.ipynb

Imports Utilized: Pandas, NumPy, Requests, BeautifulSoup, and Selenium

This code defines a function called "today_matchups" that uses web scraping to retrieve information about NBA games that are being played today from ESPN's website.

The function begins by creating a URL using the current date, which is obtained using the datetime module. A Chrome webdriver is then set up with Selenium and the page is loaded using the URL. The page source HTML is then parsed using BeautifulSoup.

The function then retrieves the date and day of the week using datetime, and finds all of the divs on the page that contain information about each game. It iterates over these divs to extract relevant data, such as the teams playing, the time of the game, the current score, and the betting odds. The data is stored in a dictionary for each game, and all of these dictionaries are appended to a list called "games_data".

Once all of the data has been extracted, it is stored in a Pandas DataFrame, which is returned by the function. The Chrome webdriver is then closed to avoid resource leakage.

Overall, this function retrieves the latest information about NBA games being played today, and stores it in a DataFrame that can be used for further analysis or visualization.

NBA_Players_Webscraping_to_SQLite.ipynb

Imports Utilized: SQLite3, Pandas

This code is a Python script that demonstrates webscraping data from a website, storing it in a Pandas DataFrame, and then inserting that data into an SQLite database. Specifically, the script scrapes NBA player data from ESPN.com for all teams in the league, stores it in a DataFrame, and then converts certain columns to numeric values (height to inches, weight to pounds, and salary to an integer). The script then creates an SQLite database with a table named "NBA_Players" and inserts the player data from the DataFrame into that table. Finally, the script commits the changes to the database and closes any open cursors.

NBA_Score_Predictions_Pipeline_V1.ipynb

Imports Utilized: Requests, BeautifulSoup, Selenium, Pandas, MatPlotLib, StatsModels, NumPy, Math, Statistics, Radnom

image

NBA_Team_Analytics_Pipeline_V2.ipynb

Data Webscraping

This code segment scrapes NBA boxscore data for the 2022-2023 regular season from stats.nba.com and basketball-reference.com using a request to a specific URL with parameters to filter the data. It retrieves the data in JSON format and converts it into a pandas DataFrame. The DataFrame is then modified to include additional columns with statistics related to field goals made, attempted, and points, as well as opponent team and opponent points, as well as other more advanced statistics such as distance and shot type. It also adds columns for the team's conference, whether the game was played at home or away, and a formatted date and matchup string.

Correlation Heatmap for Team Average DF

image

Points Scored vs Rebound Gained Regression Analysis

image

Points Scored vs Assists Gained (Wins vs Losses) Regression Analysis

image

Visualization Functions

def nba_fg_by_dist() - visualize fg% by differing distances for the whole NBA at a given time

or

def team_fg_by_dist(abbr) - visualize fg% by differing distances for a given team at a given time

image

def NBA_stat_boxplots(stat, sort_by='mean', asc=True) - visualize by team their comparing boxplots for a given stat

image

def plus_minus_plot(team_abbr)

image

def scored_allowed_compare(team_a_abbr, team_b_abbr)

image

def rebounds_compares()

image

def line_plot_scores()

image

def trend_plot_scores()

image

def shot_pies() ~ Scoring Distribution

image

Team Average Reression (statx vs staty) Plot and Analysis Function

image

def multi_len_reg()

image

3D Scatter Plot Function

image

R2 Comparison from Y Function

image

def team_stat_hist_compare()

image

def team_stat_kde_compare()

image

Guassian Game Simulations Function

image

NBA Team Report Visualization Output Example:

image image image

NBA_Team_Analytics_Pipeline_V2.ipynb

Data Webscraping

This code segment scrapes NBA individial player boxscore data for all players for the 2022-2023 regular season from stats.nba.com, espn.com and basketball-reference.com using a request to a specific URL with parameters to filter the data. It retrieves the data in JSON format and converts it into a pandas DataFrame. The DataFrame is then modified to include additional columns with statistics related to field goals made, attempted, and points, as well as other more advanced statistics such as distance and shot type. It also adds columns for the team's conference, whether the game was played at home or away, and a formatted date and matchup string. Codel as well scrapes player height, weight, salary, college, bio and playoff and allstar history.

Visualization Functions

def team_players_stat_whisker

image

def team_players_stat_bar

image

def player_stat_plot

image

def player_pra_violins()

image

def player_shooter_pies()

image

def player_stat_reg_analysis_all_perf()

image

def player_reg_analysis_sal_avg()

image

def multi_lin_reg()

image

def scatter_3d()

image

def display_player_image()

image

def player_avg_leaders()

image

def player_total_leaders()

image

def player_stat_count_hist9()

image

def stat_hist()

image

NBA Player Report Visulations Output Example:

image image

About

Python ETL Pipeline that web-scrapes up-to-date NBA data from multiple sources, then statistically analyzes and visualizes into multiple team, player and league-wide reports.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published