Skip to content
This repository has been archived by the owner on Nov 3, 2019. It is now read-only.

shangxin-wang/DS100

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DS100: Principle and Technology of Data Science (Summer 2019)

Description

Combining data, computation, and inferential thinking, data science is redefining how people and organizations solve challenging problems and understand their world.

This intermediate level class bridges between Data8 and upper division computer science and statistics courses as well as methods courses in other fields. In this class, we explore key areas of data science including question formulation, data collection and cleaning, visualization, statistical inference, predictive modeling, and decision making.

Through a strong emphasis on data centric computing, quantitative critical thinking, and exploratory data analysis, this class covers key principles and techniques of data science. These include languages for transforming, querying and analyzing data; algorithms for machine learning methods including regression, classification and clustering; principles behind creating informative data visualizations; statistical concepts of measurement error and prediction; and techniques for scalable data processing.

Prerequisites

  • Foundations of Data Science: basic exposure to python programming and working with tabular data as well as visualization, statistics, and machine learning

  • Computing: additional background in python programming (e.g., for loops, lambdas, debugging, and complexity) that will enable DS100 to focus more on the concepts in Data Science and less on the details of programming in python

  • Math: linear operators, eigenvectors, derivatives, and integrals to enable statistical inference and derive new prediction algorithms

Learning Objectives

Week 1-2 Week 3-4 Week 5-6 Week 7-8
Introduction to Data Science, Logistics, Study Design SQL Linear Regression Big Data and Ray (Guest Lecturer: Robert Nishihara)
Data Tables with pandas Dimensionality Reduction Gradient Descent Big Data and Spark, Decision Trees
Data Cleaning PCA Feature Engineering and Bias-Variance Random Forests, Runtime Analysis, Modeling Overview
Visualization Statistical Inference: Random Variables and Estimators Cross-Validation and Regularization Ethics & Conclusion
EDA & Working with Text Statistical Inference: Risk and Loss Functions Logistic Regression
Classifier Evaluation and Fitting
Decision Boundaries, Modeling Considerations
Inference for Modeling

Units

4

Grade

A-

Syllabus

http://www.ds100.org/su19/syllabus

Projects

  • Project 1: Food Safety

    • Reading simple csv files
    • Working with data at different levels of granularity
    • Identifying the type of data collected, missing values, anomalies, etc.
    • Applying probability sampling techniques
    • Exploring characteristics and distributions of individual variables
  • Project 2: Spam/Ham Classification <Feature Engineering, Logistic Regression, Cross Validation>

    • Feature engineering with text data
    • Using sklearn libraries to process data and fit models
    • Validating the performance of your model and minimizing overfitting
    • Moving Forward: Make the spam filter more accurate and get at least 88% accuracy on the test set
    • Generating and analyzing precision-recall curves
  • Project 3: Predicting Taxi Ride Duration

    • The data science lifecycle: data selection and cleaning, EDA, feature engineering, and model selection
    • Using sklearn to process data and fit linear regression models
    • Embedding linear regression as a component in a more complex model

Homework

  • HW 1: Math Review and Plotting

    • Python, NumPy
    • Multivariable Calculus, Linear Algebra, and Probability
    • Plotting
  • HW 2: Bike Sharing <Exploratory Data Analysis (EDA) and Visualization>

    • Reading plaintext delimited data into pandas
    • Wrangling data for analysis
    • Using EDA to learn about your data
    • Making informative plots
  • HW 3: Trump, Twitter, and Text

    • Importing the Data
    • Tweet Source Analysis
    • Sentiment Analysis
  • HW 5: Predicting Housing Prices

    • Simple feature engineering
    • Using sklearn to build linear models
    • Building a data pipeline using pandas
    • Analyze the error of the model
  • HW 6: Predicting Housing Prices (Continued)

    • Identifying informative variables through EDA
    • Feature engineering categorical variables
    • Using sklearn to build more complex linear models