Data Science Standards

Questions from the Galvanize Data Science Intensive

Answers from my own education and research.

What is a Standard?

Standards are the core-competencies of data scientists - the knowledge, skills, and habits needed to be hirable.

Standards by Topic

1. Python

Answers here.

Explain the difference between mutable and immutable types and their relationship to dictionaries.
Compare the strengths and weaknesses of lists vs. dictionaries.
Choose the appropriate collection (dict, Counter, defaultdict) to simplify a problem.
Compare the strengths and weaknesses of lists vs. generators.
Write pythonic code.

2. Version Control / Git

Explain the basic function and purpose of version control.
- VC allows you to modify code while always having a backup copy. It also can allow multiple programmers to work on the same code simultaneously, improving efficiency.
Use a basic Git workflow to track project changes over time, share code, and write useful commit messages.

3. OOP

Given the code for a python class, instantiate a python object and call the methods and list the attributes.
Write the python code for a simple class.

class ClassName:
  def __init__(self, parameter1, p2):
    self.var = parameter1
    self.var2 = p2

  def __repr__(self):
  """Return a text description."""
    return "{} of {}".format(self.var, self.var2)

  @property # Field can't be changed w/o making setter (@color.setter)
  def color(self):
    return self._c_map[self.p2]

  def __call__(self, elem):
  '''Perform map operation on an element'''
    return self._impl(elem)

  def _impl(self, elem):
    pass

  def method_name(self, p3):
    self.var2 += self.var * p3

Match key “magic” methods to their functionality. Full list/tutorial here: http://www.diveintopython3.net/special-method-names.html
Design a program or algorithm in object oriented fashion.
Compare and contrast functional and object oriented programming.
- In OOP the user creates and defines a new data type, which it's own attributes (data), and methods (operations). This new class can then be used to create objects.
- In functional programming, your program just uses functions and does not create an entirely new class to organize data around.
- Functional programming typically is faster, takes up less memory, but can less organized and difficult for humans to interpret if there's a lot going on.

4. SQL

Connect to a SQL database via command line (i.e. Postgres).
Connect to a database from within a python program.
State function of basic SQL commands.
Write simple queries on a single table including SELECT, FROM, WHERE, CASE clauses and aggregates.
Write complex queries including JOINS and subqueries.
Explain how indexing works in Postgres.
Create and dump tables.
Format a query to follow a standard style.
Move data from SQL database to text file.

5. Pandas

Explain/use the relationship between DataFrame and Series
Know how to set, reset indexes
Use iloc, loc, ix, and iat appropriately
- loc: label-based search.
- iloc: index-based search.
- ix: both label and index-based.
- iat: fast scaler value finding (.iat([x,y])).
Use index alignment and know when it applies
Use Split-Apply-Combine Methods
Be able to read and write data to pandas
Recognize problems that can probably be solved with Pandas (as opposed to writing vanilla Python functions).
Use basic DateTimeIndex functionality

6. Plotting

Describe the architecture of a matplotlib figure
Plot in and outside of notebooks with matplotlib and seaborn
Combine multiple datasets/categories in same plot
Use subplots effectively
Plot with Pandas
Use and explain scatter_matrix output
Use and explain a correlation heatmap
Visualize pairwise relationships with seaborn
Compare within-class distributions
Use matplotlib techniques with seaborn

7. Visualization

Explain the difference between exploratory and explanatory visualizations.
- Exploratory visualizations are used to view the nature of the dataset, while exploratory visualizations enlighten the reasoning of the actions performed on the data.
Explain what a visualization is
Don’t lie with data
Visualize multidimensional relationships with data using position, size, color, alpha, facets.
Create an explanatory visualization that makes a relationship in data explicit.

8. Workflow

Perform basic file operations from the command line, while consulting man/help/Google if necessary.
Get help using man (ex man grep)
Perform “survival” edits using vi, emacs, nano, or pico
Configure environment & aliases in .bashrc/.bash_profile/.profile
Install data science stack
Manage a process with job control
Examine system performance and kill processes
Work on a remote machine with ssh/scp
State what an RE (regular expression) is and write a simple one
State the features and use cases of grep/sed/awk/cut/paste to process/clean a text file

8. Probability

Define what a random variable is.
Explain difference between permutations and combinations.
Recite and perform major probability laws from memory:
Bayes Rule
LOTP
Chain Rule
Recite and perform major random variable formulas from memory:
E(X)
Var(X)
Cov(X,Y)
Describe what a joint distribution is and be able to perform a simple calculation using joint distribution.
Define each major probability distributions and give 1 clear example of each
Explain independence of 2 r.v.’s and implications with respect to probability formulas, covariance formulas, etc.
Compute expectation of aX+bY and explain that it is a linear operator, where X and Y are random variables
Compute variance of aX + bY
Discuss why correlation is not causation
Describe correlation and its perils, with reference to Anscombe’s quartet

9. Sampling

Compute MLE estimate for simple example (such as coin-flipping)
Pseudocode Bootstrapping for a given sample of size N.
Construct confidence interval for case where parametric construction does not work
Discuss examples of times when you need bootstrapping.
Define the Central Limit Theorem
Compute standard error
Compare and contrast the use cases of parametric and nonparametric estimation

10. Hypothesis Testing

Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the p-value for the difference of means or proportions.
Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the p-value for Chi-square test of independence
Describe a situation in which a one-tailed test would be appropriate (vs. a two-tailed test).
Given a particular situation, correctly choose among the following options:
z-test
t-test
2 sample t-test (one-sided and two-sided)
2 sample z-test (one-sided and two-sided)
Define p-value, Type I error, Type II error, significance level and discuss their significance in an example problem.
- Type I error: (false positive) Rejecting the null hypothesis when it was True
- Type 2 error: (false negative) Deciding the null hypothesis is correct when it is not
Account for the multiple comparisons problem via Bonferroni correction.
Compute the difference of two independent random normal variables.
Discuss when to use an A/B test to evaluate the efficacy of a treatment

11. Power

Define Power and relate it to the Type II error.
Compute power given a dataset and a problem.
Explain how the following factors contribute to power:
- sample size
- effect size (difference between sample statistics and statistic formulated under the null)
- significance level
Identify what can be done to increase power.
Estimate sample size required of a test (power analysis) for one sample mean or proportion case
Solve by hand for the posterior distribution for a uniform prior based on coin flips.
Solve Discrete Bayes problem with some data
What is the difference between Bayesian and Frequentist inference, with respect to fixed parameters and prior beliefs?
Define power - Be able to draw the picture with two normal curves with different means and highlight the section that represents Power.
Explain trade off between significance and power

12. Multi Armed Bandit

Explain the difference between a frequentist A/B test and a Bayesian A/B test.
Define and explain prior, likelihood, and posterior.
Explain what a conjugate prior is and how it applies to A/B testing.
Analyze an A/B test with the Bayesian approach.
Explain how multi-armed bandit addresses the tradeoff between exploitation and exploration, and the relationship to regret.
Write pseudocode for the Multi-Armed Bandit algorithm.

13. Linear Algebra in Python

Perform basic Linear Algebra operations by hand: Multiply matrices, subtract matrices, Transpose matrices, verify inverses.
Perform linear algebra operations (multiply matrices, transpose matrices, and invert matrices) in numpy.

14. Exploratory Data Analysis (EDA)

Define EDA in your own words.
Identify the key questions of EDA.
Perform EDA on a dataset.

15. Linear Regression

State and troubleshoot the assumptions of linear regression model. Describe, interpret, and visualize the model form of linear regression: Y = B0+B1X1+B2X2+....
Relate Beta vector solution of Ordinary Least Squares to the cost function (residual sum of squares)
Perform ordinary least squares (OLS) with statsmodels and interpret the output: Beta coefficients, p-values, R^2, adjusted-R^2, AIC, BIC
Explain how to incorporate interactions and categorical variables into linear regression
Explain how one can detect outliers

16. Cross Validation & Regularized Linear Regression

Perform (one-fold) cross-validation on dataset (train test splitting)
Algorithmically, explain k-fold cross-validation
Give the reasoning for using k-fold cross-validation
Given one full model and one regularized model, name 2 appropriate ways to compare the two models. Name 1 inappropriate way.
Generally, when we increase flexibility or complexity of model, what happens to bias? variance? training error? test error?
Compare and contrast Lasso and Ridge regression.
What happens to Bias and Variance as we change the following factors: sample size, number of parameters, etc.
What is the cost function for Ridge? for Lasso?
Build test error curve for Ridge regression, while varying the alpha parameter, to determine optimal level or regularization
Build and interpret Learning curves for two learning algorithms, one that is overfit (high variance, low bias) and one that is underfit (low variance, high bias)

17. Logistic Regression

Place logistic regression in the taxonomy of ML algorithms
Fit and interpret a logistic regression model in scikit-learn
Interpret the coefficients of logistic regression, using odds ratio
Explain ROC curves
Explain the key differences and similarities between logistic and linear regression.

18. Gradient Descent

Identify and justify use cases for and failure modes of gradient descent.
Write pseudocode of the gradient descent and stochastic gradient descent algorithms.
Compare and contrast batch and stochastic gradient descent - the algorithms, costs, and benefits.

19. Decision Trees

Thoroughly explain the construction of a decision tree (classification or regression), including selecting an impurity measure (gini, entropy, variance)
Recognize overfitting and explain pre/post pruning and why it helps.
Pick the ‘best’ tree via cross-validation, for a given data set.
Discuss pros and cons

20. k-th nearest neighbor (kNN)

Write pseudocode for the kNN algorithm from scratch
State differences between kNN regression and classification
Discuss Pros and Cons of kNN

21. Random Forest

Thoroughly explain the construction of a random forest (classification or regression) algorithm
Explain the relationship and difference between random forest and bagging.
Explain why random forests are more accurate than a single decision tree.
Explain how to get feature importances from a random forest using an algorithm
How is OOB error calculated and what is it an estimate of?

22. Boosted Trees

Define boosting in your own words.
Be able to interpret boosting output
List advantages and disadvantages of boosting.
Compare and contrast boosting with other ensemble methods
Explain each of the tuning parameters and specifically how they affect the model
Learn, tune, and score a model using scikit-learn’s boosting class
Implement AdaBoost

23. Support Vector Machines (SVM)

Compute a hyperplane as a decision boundary in SVC
Explain what a support vector is in plain english
Recognize that preprocessing, specifically making sure all predictors are on the same scale, is a necessary step
Explain SVC using the hyperparameter, C
Tune a SVM with an RBF using both hyperparameters C and gamma
Tune a SVM with a polynomial kernel using both hyperparameters C and degree
Describe why generally speaking, an SVM with RBF kernel is more likely to perform well on “tall” data as opposed to “wide” data.
For SVMs with RBF, state what happens to bias and variance as we increase the hyperparameter “C”. State what happens to bias and variance as we increase the hyperparameter “gamma”.
State how the “one-vs-one” and “one-vs-rest” approaches for multi-class problems are implemented.
Describe the kernel trick, being able to calculate as if high dimensional space.

24. Profit Curves

Describe the issues with imbalanced classes.
Explain the profit curve method for thresholding.
Explain sampling methods and give examples of sampling methods.
Explain how they deal with imbalanced classes.
Explain cost sensitive learning and how it deals with imbalanced classes.

25. Webscraping

Compare and contrast SQL and noSQL.
Complete basic operations with mongo.
Explain the basic concepts of HTML.
Write python code to pull out an element from a web page.
Fetch data from an existing API

26. Naive Bayes

Derive the naive bayes algorithm and discuss its assumptions.
Contrast generative and discriminative models.
Discuss the pros and cons of Naive Bayes.

27. NLP

Identify and explain ways of featurizing text.
List and explain distance metrics used in document classification.
Featurize a text corpus in Python using nltk and scikit-learn.

28. Clustering

List the characteristics of a dataset necessary to perform K-means
Detail the k-means algorithm in steps, commenting on convergence or lack thereof.
Use the elbow method to determine K and evaluate the choice
Interpret Silhouette plot
Interpret clusters by examining cluster centers, and exploring the data within each cluster (dataframe inspection, plotting, decision trees for cluster membership)
Build and interpret a dendrogram using hierarchical clustering.
Compare and contrast k-means and hierarchical clustering.

29. Churn Case Study

List and explain the steps in CRISP-DM (Cross-Industry Standard Process for Data Mining)
Perform EDA standards on case study including visualizations
Discuss ramifications of deleting missing values when
MAR (missing at random)
MCAR (missing completely at random)
MNAR (missing not at random)
Explain imputing missing using at least 2 different methods, list pros and cons of each method
Explain when dropping rows is okay, when dropping features is okay?
Be able to perform the feature engineering process
Be able to identify target leak, and explain why this happens
State appropriate business goal and evaluation metric

30. Dimensionality Reduction

List reasons for reducing the dimensions.
Describe how the principal components are constructed in PCA.
Interpret the principal components of PCA.
Determine how many principal components to keep.
Describe the relationship between PCA and SVD.
Compute and interpret PCA using sklearn.
Memorize the eigenvalue equation

31. NMF

Write down and explain the NMF equation.
Compare and contrast NMF, SVD, and PCA, and k-means
Implement Alternating-Least-Squares algorithm for NMF
Find and interpret latent topics in a corpus of documents with NMF
Explain how to interpret H matrix? W matrix?
Explain regularization in the context of NMF.

32. Recommender Systems

Survey approaches to recommenders, their pros & cons, and when each is likely to be best.
Describe the cold start problem and know how it affects different recommendation strategies
Explain either the collaborative filtering algorithm or the matrix factorization recommender algorithm.
Discuss recommender evaluation.
Discuss performance concerns for recommenders.

33. Graphs

Define a graph and discuss the implementation.
List common applications of graph models.
Discuss the searching algorithms and applications of them.
Explain the various ways of measuring the importance of a node.
Explain methods and applications of clustering on a graph.
Use appropriate package to build graph data structure in Python and execute common algorithms (shortest path, connected components, …)
Explain the various ways of measuring the importance of a node.
Explain methods and applications of clustering on a graph.

34. Cloud Computing

Scope & Configure a data science environment on AWS.
Protect AWS resources against unauthorized access.
Manage AWS resources using awscli, ssh, scp, or boto3.
Monitor and control costs incurred on AWS

35. Parallel Computing

Define and contrast processes vs. threads
Define and contrast parallelism and concurrency.
Recognize problems that require parallelism or concurrency
Implement parallel and concurrent solutions
Instrument approaches to see the benefit of threading/parallelism.

36. Map Reduce

Explain Types of Problems which benefit from MapReduce
Describe map-reduce, and how it relates to Hadoop
Explain how to select the number of mappers and reducers
Describe the role of keys in MapReduce
Perform MapReduce in python using MRJob.

37. Time Series

Recognize when time series analysis could be applied
Define key times series concepts
Determine structure of a time-series using graphical tools
Compute a forecast using Box-Jenkins Methodology
Evaluate models/forecasts using cross validation and statistical tests
Engineer features to handle seasonal, calendar, and periodic components
Explain taxonomy of exponential smoothing using ETS framework

38. Spark

Configure a machine to use spark effectively
Describe differences and similarities between MapReduce and Spark
Get data into spark for processing.
Describe lazy evaluation in the context of Spark.
- Spark stores transformations for future use, when called by an action
Cache RDDs effectively to improve performance.
Use Spark to compute basic statistics
Know the difference between Spark data types: RDD, DataFrame, DAG
- RDD: Resilient Distributed Datasets; Distributed across cluster of partitions (chunks of data); Can recover from loss of individual partitions, nodes, slow processes; partitions immutable and traceable for repeatability.
- DAG: Directed Acyclic Graph;
Use MLLib https://spark.apache.org/docs/latest/ml-guide.html

39. SQL in Spark

Identify what distinguishes a Spark DataFrame from an RDD
Explain how to create a Spark DataFrame
Query a DF with SQL
Transform a DF with dataframe methods
Describe the challenges and requirements of saving schema’d datasets.
Use user-defined functions

40. Data Products

Explain REST architecture/API
Write a basic Flask API
Describe web architecture at a high level
Know the role of javascript in a web application
Know how to use developer tools to inspect an application
Write a basic Flask web application
Be able to describe the difference between online and offline computation

41. Fraud Case Study

Build an MVP (minimum viable product) quickly
Build a dashboard
Build system to take in online data from a stream
Build production-quality product

42. Whiteboarding

Explain the meaning of Big-Oh.
- Big-Oh is a way to generalize the worst-case runtime scenario of algorithms.
- Cousin to Big-Omega and Big-Theta
- f(x) = O(g(x))
- MIT big oh lecture
- A gentle intro to computational complexity
- Computational complexity theory (Stanford)
- Beginner’s guide to design patterns
- http://www.oodesign.com/
Analyze the runtime of code.
Solve whiteboarding interview questions.
Apply different techniques to addressing a whiteboarding interview problem

43. Business Analytics

Explain funnel metrics and applications
Identify red flags in a set of funnel metrics
Identify and discuss appropriate use cases for cohort analysis
Identify and explain the limits of data analysis
Given an open ended question, identify the business goal, metrics, and relevant data science solution.
Identify excessive or improper use of data analysis
Explain how data science is used in industry
Understand range of business problems where AB testing applies

Data Science Standards

Questions from the Galvanize Data Science Intensive

Answers from my own education and research.

What is a Standard?

Standards by Topic

1. Python

2. Version Control / Git

3. OOP

4. SQL

5. Pandas

6. Plotting

7. Visualization

8. Workflow

8. Probability

9. Sampling

10. Hypothesis Testing

11. Power

12. Multi Armed Bandit

13. Linear Algebra in Python

14. Exploratory Data Analysis (EDA)

15. Linear Regression

16. Cross Validation & Regularized Linear Regression

17. Logistic Regression

18. Gradient Descent

19. Decision Trees

20. k-th nearest neighbor (kNN)

21. Random Forest

22. Boosted Trees

23. Support Vector Machines (SVM)

24. Profit Curves

25. Webscraping

26. Naive Bayes

27. NLP

28. Clustering

29. Churn Case Study

30. Dimensionality Reduction

31. NMF

32. Recommender Systems

33. Graphs

34. Cloud Computing

35. Parallel Computing

36. Map Reduce

37. Time Series

38. Spark

39. SQL in Spark

40. Data Products

41. Fraud Case Study

42. Whiteboarding

43. Business Analytics

links

social