What is Data Science?

Interdisciplinary field used to extract knowledge or insights from data. The application of data centric, computational, and inferential thinking to understand the world & solve problems.

Is Data Science…

  • Statistics?
    • Yes, use data to infer properties of the world
  • Machine Learning?
    • Yes, use data to build algorithms that make predictions
  • Computer Science?
    • Yes, use computational thinking and abstraction to manage complexity
  • Science, Art, Engineering?
    • Yes, combines the scientific method, creative thinking, and the ability to solve complicated problems

Big Concepts in Data Science

  • Data preparation and representation
  • Efficient data processing
  • Question formulation and experimental design
  • Exploratory data analysis
  • Modelling, parameter estimation, and statistical inference
  • Various prediction methods: generalized linear models, decision trees, neural networks, clustering, PCA…
    • Overfitting, regularization, and cross validation

Principles Statistics in Data Science

  • Experimental Design & Sampling
    • How do we collect data to accurately answer questions?
  • Probability & Uncertainty
    • How do we quantify what we don’t know?
  • Modelling
    • How do we distill the essential structure of complex phenomena?
  • Inference & Prediction
    • How do we use the known to reason about the unknown?

Principles of Computer Science in Data Science

  • Software Design & Debugging
    • How do we develop and maintain reliable & repeatable analysis?
  • Abstraction & Algorithm Design
    • How do we break big problems into small problems?
  • Computational Complexity
    • How do we tradeoff time and space to compute efficiently?
  • Parallelism & Locality
    • How do we divide computation across resources?

Domain Knowledge

  • What is the key questions/problems in the domain?
  • What is the context of the data?
    • What data is already available?
    • How and why was it collected?
    • What is the schema and limitations of the data?
    • How can more data be collected/obtained?
  • What is the underlying process that generates the data?
    • casual structure, dependencies, …

Data Scientists must be inquisitive and learn new domains quickly.

Data Science Lifecycle

High-level description of the data science workflow:

graph LR

id1[Obtain Data]
id2[Understand Data]
id3[Understand World]
id4[Ask Question]
id5[Reports & Data-products]

id1 --> id4
id2 --> id4
id2 --> id3
id2 --> id1
id3 --> id1
id3 --> id2
id3 --> id5
id3 --> id1
id2 --> id5
  • Frames questions & design experiments
  • Obtain and clean data
  • Summarize and visualize data
  • Inference and prediction

Info

End of lecture 1, the rest of the notes are taken from the slides.

What is Big Data?

  • Big in rows (size n)
  • Big in columns (dimensions p)
  • Hard to extract value from
  • Big data is high volume, high velocity and high variety information assets that demand cost-effective, innovative forms of information processing

Sources of Big Data

  • Data as the by-product of other activities
    • Click trail, clicks before a purchase
    • Moving your body (Apple Watch)
    • Data were always being generated. They just weren’t being captured.
      • Cheaper, smaller sensors help
  • Data as the primary goal of activities
    • Telescopes, Genetic sequences, 61 million person experiments
  • Web
    • 20 billion webpages, each 20KB = 400TB
  • Astronomy
    • Apache Point Telescope, 200GB/night
    • Large Synoptic Survey Telescope: 3 billion pixel camera
  • Life Sciences
    • High throughput sequencer: 1TB/day

Implication for Statistics

  • Little data, big data
    • Sampling still matters
  • Everything is significant (The Starry Night)
  • Inverting a matrix
    • (Stochastic) Gradient Descent (Ascent), BFGS, …
  • Casual Inference

Implication for Computation

  • Conventional Understanding of what is computationally tractable: Polynomial time algorithm (N^K)
  • Now it is (N^k)/m, where m is the number of computers.
  • For really big data: N*log(N)
    • Traversing a binary tree, sort, and search N*log(N)
    • Streaming application

Our Goals

  • Prepare students for advanced Data Science courses in data-management, machine learning, and statistics (by providing the necessary foundation and context)
  • Enable students to start careers as data scientists by providing experience in working with real data, tools, and techniques
  • Empower students to apply computational and inferential thinking to tackle real world problems

Note

End of lecture 1.

Introduction to Python

What is Python?

  • General purpose programming language
  • Interpreted programming language
    • Code is translated and executed one line at a time
  • Object-oriented language
    • Objects created from classes

Python 2 vs Python 3

Python 2 is a older version of Python.

Important

Python 3 is not backwards compatible. If you write a Python 3 program and run it using Python 2, there is a good chance it won’t work.

A Simple Python Program

# Print two messages
print("Welcome to Python!")
print("Python is fun.")

You can run this program using:

python file.py
# OR
python3 file.py

PIP - Package Manager

  • Package-management software written in Python
  • Used to install and manage package installations
  • Pip connects to Python Package Index
    • Online repository of public Python Libraries
  • We will be using pytest to test lab assignments, and you can install it using the following command
pip install -U pytest

Statement

A statement represents an action or a sequence of actions. The statement print("Welcome to Python!") is a statement to display “Welcome to Python!” in the console.

Indentation

The indentation matters in Python. Note that the following program would fail due to a IndentationError: unexpected indent.

  print("Welcome to Python!")
print("Python is fun.")

Comments

Note that a # Character starts a comment for the current line. 3 quotation marks ''' ''' will also open and close a paragraph comment (multi-line comment).

# This is a comment
 
print("Comment test") # This is also a comment
 
'''
This is a 
multi-line
comment
'''
 
This is not a comment

Special Symbols

CharacterNameDescriptions
()Opening and closing parenthesesUsed with functions, arrays, etc.
#Hashtag/Pound signStarts a comment
" "Opening and closing quotation marksEnclosing a string
''' '''Opening and closing quotation marks (3x)Enclosing a paragraph comment

Proper Indentation and Spacing (for this course)

  • Comments
    • Include your name, class section, instructor, date, and a brief description at the beginning of the program
  • Indentation
    • 4 spaces
  • Spacing
    • Use blank line to separate segments of the code

Programming Errors

  • Syntax errors
    • Error is code construction
  • Runtime error
    • Somewhere in the code causes the program to abort
  • Logic error
    • Code produces incorrect result

Variables, Expressions, Assignment, & Simultaneous Assignment

# Compute the first area
radius = 1.0
radius = radius + 0
radius = r = 0
area = radius * radius * 3.14159
print("The area is ", area, " for  radius ", radius)
x, y = y, x # Swap x with y

Note

Wow, there are 119 slides covering the basics to Python. I already know the basics to Python. I don’t need to spend hours writing notes on this lol.