What is Data Science?
Interdisciplinary field used to extract knowledge or insights from data. The application of data centric, computational, and inferential thinking to understand the world & solve problems.
Is Data Science…
- Statistics?
- Yes, use data to infer properties of the world
- Machine Learning?
- Yes, use data to build algorithms that make predictions
- Computer Science?
- Yes, use computational thinking and abstraction to manage complexity
- Science, Art, Engineering?
- Yes, combines the scientific method, creative thinking, and the ability to solve complicated problems
Big Concepts in Data Science
- Data preparation and representation
- Efficient data processing
- Question formulation and experimental design
- Exploratory data analysis
- Modelling, parameter estimation, and statistical inference
- Various prediction methods: generalized linear models, decision trees, neural networks, clustering, PCA…
- Overfitting, regularization, and cross validation
Principles Statistics in Data Science
- Experimental Design & Sampling
- How do we collect data to accurately answer questions?
- Probability & Uncertainty
- How do we quantify what we don’t know?
- Modelling
- How do we distill the essential structure of complex phenomena?
- Inference & Prediction
- How do we use the known to reason about the unknown?
Principles of Computer Science in Data Science
- Software Design & Debugging
- How do we develop and maintain reliable & repeatable analysis?
- Abstraction & Algorithm Design
- How do we break big problems into small problems?
- Computational Complexity
- How do we tradeoff time and space to compute efficiently?
- Parallelism & Locality
- How do we divide computation across resources?
Domain Knowledge
- What is the key questions/problems in the domain?
- What is the context of the data?
- What data is already available?
- How and why was it collected?
- What is the schema and limitations of the data?
- How can more data be collected/obtained?
- What is the underlying process that generates the data?
- casual structure, dependencies, …
Data Scientists must be inquisitive and learn new domains quickly.
Data Science Lifecycle
High-level description of the data science workflow:
graph LR id1[Obtain Data] id2[Understand Data] id3[Understand World] id4[Ask Question] id5[Reports & Data-products] id1 --> id4 id2 --> id4 id2 --> id3 id2 --> id1 id3 --> id1 id3 --> id2 id3 --> id5 id3 --> id1 id2 --> id5
- Frames questions & design experiments
- Obtain and clean data
- Summarize and visualize data
- Inference and prediction
Info
End of lecture 1, the rest of the notes are taken from the slides.
What is Big Data?
- Big in rows (size n)
- Big in columns (dimensions p)
- Hard to extract value from
- Big data is high volume, high velocity and high variety information assets that demand cost-effective, innovative forms of information processing
Sources of Big Data
- Data as the by-product of other activities
- Click trail, clicks before a purchase
- Moving your body (Apple Watch)
- Data were always being generated. They just weren’t being captured.
- Cheaper, smaller sensors help
- Data as the primary goal of activities
- Telescopes, Genetic sequences, 61 million person experiments
- Web
- 20 billion webpages, each 20KB = 400TB
- Astronomy
- Apache Point Telescope, 200GB/night
- Large Synoptic Survey Telescope: 3 billion pixel camera
- Life Sciences
- High throughput sequencer: 1TB/day
Implication for Statistics
- Little data, big data
- Sampling still matters
- Everything is significant (The Starry Night)
- Inverting a matrix
- (Stochastic) Gradient Descent (Ascent), BFGS, …
- Casual Inference
Implication for Computation
- Conventional Understanding of what is computationally tractable: Polynomial time algorithm (N^K)
- Now it is (N^k)/m, where m is the number of computers.
- For really big data: N*log(N)
- Traversing a binary tree, sort, and search N*log(N)
- Streaming application
Our Goals
- Prepare students for advanced Data Science courses in data-management, machine learning, and statistics (by providing the necessary foundation and context)
- Enable students to start careers as data scientists by providing experience in working with real data, tools, and techniques
- Empower students to apply computational and inferential thinking to tackle real world problems
Note
End of lecture 1.
Introduction to Python
What is Python?
- General purpose programming language
- Interpreted programming language
- Code is translated and executed one line at a time
- Object-oriented language
- Objects created from classes
Python 2 vs Python 3
Python 2 is a older version of Python.
Important
Python 3 is not backwards compatible. If you write a Python 3 program and run it using Python 2, there is a good chance it won’t work.
A Simple Python Program
# Print two messages
print("Welcome to Python!")
print("Python is fun.")
You can run this program using:
python file.py
# OR
python3 file.py
PIP - Package Manager
- Package-management software written in Python
- Used to install and manage package installations
- Pip connects to Python Package Index
- Online repository of public Python Libraries
- We will be using pytest to test lab assignments, and you can install it using the following command
pip install -U pytest
Statement
A statement represents an action or a sequence of actions. The statement print("Welcome to Python!")
is a statement to display “Welcome to Python!” in the console.
Indentation
The indentation matters in Python. Note that the following program would fail due to a IndentationError: unexpected indent
.
print("Welcome to Python!")
print("Python is fun.")
Comments
Note that a #
Character starts a comment for the current line. 3 quotation marks ''' '''
will also open and close a paragraph comment (multi-line comment).
# This is a comment
print("Comment test") # This is also a comment
'''
This is a
multi-line
comment
'''
This is not a comment
Special Symbols
Character | Name | Descriptions |
---|---|---|
() | Opening and closing parentheses | Used with functions, arrays, etc. |
# | Hashtag/Pound sign | Starts a comment |
" " | Opening and closing quotation marks | Enclosing a string |
''' ''' | Opening and closing quotation marks (3x) | Enclosing a paragraph comment |
Proper Indentation and Spacing (for this course)
- Comments
- Include your name, class section, instructor, date, and a brief description at the beginning of the program
- Indentation
- 4 spaces
- Spacing
- Use blank line to separate segments of the code
Programming Errors
- Syntax errors
- Error is code construction
- Runtime error
- Somewhere in the code causes the program to abort
- Logic error
- Code produces incorrect result
Variables, Expressions, Assignment, & Simultaneous Assignment
# Compute the first area
radius = 1.0
radius = radius + 0
radius = r = 0
area = radius * radius * 3.14159
print("The area is ", area, " for radius ", radius)
x, y = y, x # Swap x with y
Note
Wow, there are 119 slides covering the basics to Python. I already know the basics to Python. I don’t need to spend hours writing notes on this lol.