DS100: Principle and Technology of Data Science (Summer 2019)

Description

Combining data, computation, and inferential thinking, data science is redefining how people and organizations solve challenging problems and understand their world.

This intermediate level class bridges between Data8 and upper division computer science and statistics courses as well as methods courses in other fields. In this class, we explore key areas of data science including question formulation, data collection and cleaning, visualization, statistical inference, predictive modeling, and decision making.

Through a strong emphasis on data centric computing, quantitative critical thinking, and exploratory data analysis, this class covers key principles and techniques of data science. These include languages for transforming, querying and analyzing data; algorithms for machine learning methods including regression, classification and clustering; principles behind creating informative data visualizations; statistical concepts of measurement error and prediction; and techniques for scalable data processing.

Prerequisites

Foundations of Data Science: basic exposure to python programming and working with tabular data as well as visualization, statistics, and machine learning
Computing: additional background in python programming (e.g., for loops, lambdas, debugging, and complexity) that will enable DS100 to focus more on the concepts in Data Science and less on the details of programming in python
Math: linear operators, eigenvectors, derivatives, and integrals to enable statistical inference and derive new prediction algorithms

Learning Objectives

Week 1-2	Week 3-4	Week 5-6	Week 7-8
Introduction to Data Science, Logistics, Study Design	SQL	Linear Regression	Big Data and Ray (Guest Lecturer: Robert Nishihara)
Data Tables with pandas	Dimensionality Reduction	Gradient Descent	Big Data and Spark, Decision Trees
Data Cleaning	PCA	Feature Engineering and Bias-Variance	Random Forests, Runtime Analysis, Modeling Overview
Visualization	Statistical Inference: Random Variables and Estimators	Cross-Validation and Regularization	Ethics & Conclusion
EDA & Working with Text	Statistical Inference: Risk and Loss Functions	Logistic Regression
		Classifier Evaluation and Fitting
		Decision Boundaries, Modeling Considerations
		Inference for Modeling

Units

4

Grade

A-

Syllabus

http://www.ds100.org/su19/syllabus

Projects

Project 1: Food Safety
- Reading simple csv files
- Working with data at different levels of granularity
- Identifying the type of data collected, missing values, anomalies, etc.
- Applying probability sampling techniques
- Exploring characteristics and distributions of individual variables
Project 2: Spam/Ham Classification <Feature Engineering, Logistic Regression, Cross Validation>
- Feature engineering with text data
- Using sklearn libraries to process data and fit models
- Validating the performance of your model and minimizing overfitting
- Moving Forward: Make the spam filter more accurate and get at least 88% accuracy on the test set
- Generating and analyzing precision-recall curves
Project 3: Predicting Taxi Ride Duration
- The data science lifecycle: data selection and cleaning, EDA, feature engineering, and model selection
- Using sklearn to process data and fit linear regression models
- Embedding linear regression as a component in a more complex model

Homework

HW 1: Math Review and Plotting
- Python, NumPy
- Multivariable Calculus, Linear Algebra, and Probability
- Plotting
HW 2: Bike Sharing <Exploratory Data Analysis (EDA) and Visualization>
- Reading plaintext delimited data into pandas
- Wrangling data for analysis
- Using EDA to learn about your data
- Making informative plots
HW 3: Trump, Twitter, and Text
- Importing the Data
- Tweet Source Analysis
- Sentiment Analysis
HW 5: Predicting Housing Prices
- Simple feature engineering
- Using sklearn to build linear models
- Building a data pipeline using pandas
- Analyze the error of the model
HW 6: Predicting Housing Prices (Continued)
- Identifying informative variables through EDA
- Feature engineering categorical variables
- Using sklearn to build more complex linear models

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
disc		disc
final		final
hw1		hw1
hw2		hw2
hw3		hw3
hw5		hw5
hw6		hw6
lab01		lab01
lab02		lab02
lab03		lab03
lab05		lab05
lab06		lab06
lab07		lab07
lab08		lab08
lab09		lab09
lab10		lab10
lab11		lab11
lab12		lab12
lec		lec
proj		proj
README.md		README.md

shangxin-wang/DS100

Folders and files

Latest commit

History

Repository files navigation

DS100: Principle and Technology of Data Science (Summer 2019)

About

Topics

Resources

Stars

Watchers

Forks

Languages