Data Science Standards
Questions from the Galvanize Data Science Intensive
Answers from my own education and research.
What is a Standard?
Standards are the core-competencies of data scientists - the knowledge, skills, and habits needed to be hirable.
Standards by Topic
1. Python
- Explain the difference between mutable and immutable types and their relationship to dictionaries.
- Compare the strengths and weaknesses of lists vs. dictionaries.
- Choose the appropriate collection (dict, Counter, defaultdict) to simplify a problem.
- Compare the strengths and weaknesses of lists vs. generators.
- Write pythonic code.
2. Version Control / Git
- Explain the basic function and purpose of version control.
- VC allows you to modify code while always having a backup copy. It also can allow multiple programmers to work on the same code simultaneously, improving efficiency.
- Use a basic Git workflow to track project changes over time, share code, and write useful commit messages.
3. OOP
- Given the code for a python class, instantiate a python object and call the methods and list the attributes.
- Write the python code for a simple class.
class ClassName:
def __init__(self, parameter1, p2):
self.var = parameter1
self.var2 = p2
def __repr__(self):
"""Return a text description."""
return "{} of {}".format(self.var, self.var2)
@property # Field can't be changed w/o making setter (@color.setter)
def color(self):
return self._c_map[self.p2]
def __call__(self, elem):
'''Perform map operation on an element'''
return self._impl(elem)
def _impl(self, elem):
pass
def method_name(self, p3):
self.var2 += self.var * p3
-
Match key “magic” methods to their functionality. Full list/tutorial here: http://www.diveintopython3.net/special-method-names.html
-
Design a program or algorithm in object oriented fashion.
- Compare and contrast functional and object oriented programming.
- In OOP the user creates and defines a new data type, which it's own attributes (data), and methods (operations). This new class can then be used to create objects.
- In functional programming, your program just uses functions and does not create an entirely new class to organize data around.
- Functional programming typically is faster, takes up less memory, but can less organized and difficult for humans to interpret if there's a lot going on.
4. SQL
- Connect to a SQL database via command line (i.e. Postgres).
- Connect to a database from within a python program.
- State function of basic SQL commands.
- Write simple queries on a single table including SELECT, FROM, WHERE, CASE clauses and aggregates.
- Write complex queries including JOINS and subqueries.
- Explain how indexing works in Postgres.
- Create and dump tables.
- Format a query to follow a standard style.
- Move data from SQL database to text file.
5. Pandas
- Explain/use the relationship between DataFrame and Series
- Know how to set, reset indexes
- Use iloc, loc, ix, and iat appropriately
loc
: label-based search.iloc
: index-based search.ix
: both label and index-based.iat
: fast scaler value finding (.iat([x,y])
).
- Use index alignment and know when it applies
- Use Split-Apply-Combine Methods
- Be able to read and write data to pandas
- Recognize problems that can probably be solved with Pandas (as opposed to writing vanilla Python functions).
- Use basic DateTimeIndex functionality
6. Plotting
- Describe the architecture of a matplotlib figure
- Plot in and outside of notebooks with matplotlib and seaborn
- Combine multiple datasets/categories in same plot
- Use subplots effectively
- Plot with Pandas
- Use and explain scatter_matrix output
- Use and explain a correlation heatmap
- Visualize pairwise relationships with seaborn
- Compare within-class distributions
- Use matplotlib techniques with seaborn
7. Visualization
- Explain the difference between exploratory and explanatory visualizations.
- Exploratory visualizations are used to view the nature of the dataset, while exploratory visualizations enlighten the reasoning of the actions performed on the data.
- Explain what a visualization is
- Don’t lie with data
- Visualize multidimensional relationships with data using position, size, color, alpha, facets.
- Create an explanatory visualization that makes a relationship in data explicit.
8. Workflow
- Perform basic file operations from the command line, while consulting man/help/Google if necessary.
- Get help using
man
(ex man grep) - Perform “survival” edits using vi, emacs, nano, or pico
- Configure environment & aliases in .bashrc/.bash_profile/.profile
- Install data science stack
- Manage a process with job control
- Examine system performance and kill processes
- Work on a remote machine with ssh/scp
- State what an RE (regular expression) is and write a simple one
- State the features and use cases of grep/sed/awk/cut/paste to process/clean a text file
8. Probability
- Define what a random variable is.
- Explain difference between permutations and combinations.
- Recite and perform major probability laws from memory:
- Bayes Rule
- LOTP
- Chain Rule
- Recite and perform major random variable formulas from memory:
- E(X)
- Var(X)
- Cov(X,Y)
- Describe what a joint distribution is and be able to perform a simple calculation using joint distribution.
- Define each major probability distributions and give 1 clear example of each
- Explain independence of 2 r.v.’s and implications with respect to probability formulas, covariance formulas, etc.
- Compute expectation of aX+bY and explain that it is a linear operator, where X and Y are random variables
- Compute variance of aX + bY
- Discuss why correlation is not causation
- Describe correlation and its perils, with reference to Anscombe’s quartet
9. Sampling
- Compute MLE estimate for simple example (such as coin-flipping)
- Pseudocode Bootstrapping for a given sample of size N.
- Construct confidence interval for case where parametric construction does not work
- Discuss examples of times when you need bootstrapping.
- Define the Central Limit Theorem
- Compute standard error
- Compare and contrast the use cases of parametric and nonparametric estimation
10. Hypothesis Testing
- Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the p-value for the difference of means or proportions.
- Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the p-value for Chi-square test of independence
- Describe a situation in which a one-tailed test would be appropriate (vs. a two-tailed test).
- Given a particular situation, correctly choose among the following options:
- z-test
- t-test
- 2 sample t-test (one-sided and two-sided)
- 2 sample z-test (one-sided and two-sided)
- Define p-value, Type I error, Type II error, significance level and discuss their significance in an example problem.
- Type I error: (false positive) Rejecting the null hypothesis when it was True
- Type 2 error: (false negative) Deciding the null hypothesis is correct when it is not
- Account for the multiple comparisons problem via Bonferroni correction.
- Compute the difference of two independent random normal variables.
- Discuss when to use an A/B test to evaluate the efficacy of a treatment
11. Power
- Define Power and relate it to the Type II error.
- Compute power given a dataset and a problem.
- Explain how the following factors contribute to power:
- sample size
- effect size (difference between sample statistics and statistic formulated under the null)
- significance level
- Identify what can be done to increase power.
- Estimate sample size required of a test (power analysis) for one sample mean or proportion case
- Solve by hand for the posterior distribution for a uniform prior based on coin flips.
- Solve Discrete Bayes problem with some data
- What is the difference between Bayesian and Frequentist inference, with respect to fixed parameters and prior beliefs?
- Define power - Be able to draw the picture with two normal curves with different means and highlight the section that represents Power.
- Explain trade off between significance and power
12. Multi Armed Bandit
- Explain the difference between a frequentist A/B test and a Bayesian A/B test.
- Define and explain prior, likelihood, and posterior.
- Explain what a conjugate prior is and how it applies to A/B testing.
- Analyze an A/B test with the Bayesian approach.
- Explain how multi-armed bandit addresses the tradeoff between exploitation and exploration, and the relationship to regret.
- Write pseudocode for the Multi-Armed Bandit algorithm.
13. Linear Algebra in Python
- Perform basic Linear Algebra operations by hand: Multiply matrices, subtract matrices, Transpose matrices, verify inverses.
- Perform linear algebra operations (multiply matrices, transpose matrices, and invert matrices) in numpy.
14. Exploratory Data Analysis (EDA)
- Define EDA in your own words.
- Identify the key questions of EDA.
- Perform EDA on a dataset.
15. Linear Regression
- State and troubleshoot the assumptions of linear regression model. Describe, interpret, and visualize the model form of linear regression: Y = B0+B1X1+B2X2+....
- Relate Beta vector solution of Ordinary Least Squares to the cost function (residual sum of squares)
- Perform ordinary least squares (OLS) with statsmodels and interpret the output: Beta coefficients, p-values, R^2, adjusted-R^2, AIC, BIC
- Explain how to incorporate interactions and categorical variables into linear regression
- Explain how one can detect outliers
16. Cross Validation & Regularized Linear Regression
- Perform (one-fold) cross-validation on dataset (train test splitting)
- Algorithmically, explain k-fold cross-validation
- Give the reasoning for using k-fold cross-validation
- Given one full model and one regularized model, name 2 appropriate ways to compare the two models. Name 1 inappropriate way.
- Generally, when we increase flexibility or complexity of model, what happens to bias? variance? training error? test error?
- Compare and contrast Lasso and Ridge regression.
- What happens to Bias and Variance as we change the following factors: sample size, number of parameters, etc.
- What is the cost function for Ridge? for Lasso?
- Build test error curve for Ridge regression, while varying the alpha parameter, to determine optimal level or regularization
- Build and interpret Learning curves for two learning algorithms, one that is overfit (high variance, low bias) and one that is underfit (low variance, high bias)
17. Logistic Regression
- Place logistic regression in the taxonomy of ML algorithms
- Fit and interpret a logistic regression model in scikit-learn
- Interpret the coefficients of logistic regression, using odds ratio
- Explain ROC curves
- Explain the key differences and similarities between logistic and linear regression.
18. Gradient Descent
- Identify and justify use cases for and failure modes of gradient descent.
- Write pseudocode of the gradient descent and stochastic gradient descent algorithms.
- Compare and contrast batch and stochastic gradient descent - the algorithms, costs, and benefits.
19. Decision Trees
- Thoroughly explain the construction of a decision tree (classification or regression), including selecting an impurity measure (gini, entropy, variance)
- Recognize overfitting and explain pre/post pruning and why it helps.
- Pick the ‘best’ tree via cross-validation, for a given data set.
- Discuss pros and cons
20. k-th nearest neighbor (kNN)
- Write pseudocode for the kNN algorithm from scratch
- State differences between kNN regression and classification
- Discuss Pros and Cons of kNN
21. Random Forest
- Thoroughly explain the construction of a random forest (classification or regression) algorithm
- Explain the relationship and difference between random forest and bagging.
- Explain why random forests are more accurate than a single decision tree.
- Explain how to get feature importances from a random forest using an algorithm
- How is OOB error calculated and what is it an estimate of?
22. Boosted Trees
- Define boosting in your own words.
- Be able to interpret boosting output
- List advantages and disadvantages of boosting.
- Compare and contrast boosting with other ensemble methods
- Explain each of the tuning parameters and specifically how they affect the model
- Learn, tune, and score a model using scikit-learn’s boosting class
- Implement AdaBoost
23. Support Vector Machines (SVM)
- Compute a hyperplane as a decision boundary in SVC
- Explain what a support vector is in plain english
- Recognize that preprocessing, specifically making sure all predictors are on the same scale, is a necessary step
- Explain SVC using the hyperparameter, C
- Tune a SVM with an RBF using both hyperparameters C and gamma
- Tune a SVM with a polynomial kernel using both hyperparameters C and degree
- Describe why generally speaking, an SVM with RBF kernel is more likely to perform well on “tall” data as opposed to “wide” data.
- For SVMs with RBF, state what happens to bias and variance as we increase the hyperparameter “C”. State what happens to bias and variance as we increase the hyperparameter “gamma”.
- State how the “one-vs-one” and “one-vs-rest” approaches for multi-class problems are implemented.
- Describe the kernel trick, being able to calculate as if high dimensional space.
24. Profit Curves
- Describe the issues with imbalanced classes.
- Explain the profit curve method for thresholding.
- Explain sampling methods and give examples of sampling methods.
- Explain how they deal with imbalanced classes.
- Explain cost sensitive learning and how it deals with imbalanced classes.
25. Webscraping
- Compare and contrast SQL and noSQL.
- Complete basic operations with mongo.
- Explain the basic concepts of HTML.
- Write python code to pull out an element from a web page.
- Fetch data from an existing API
26. Naive Bayes
- Derive the naive bayes algorithm and discuss its assumptions.
- Contrast generative and discriminative models.
- Discuss the pros and cons of Naive Bayes.
27. NLP
- Identify and explain ways of featurizing text.
- List and explain distance metrics used in document classification.
- Featurize a text corpus in Python using nltk and scikit-learn.
28. Clustering
- List the characteristics of a dataset necessary to perform K-means
- Detail the k-means algorithm in steps, commenting on convergence or lack thereof.
- Use the elbow method to determine K and evaluate the choice
- Interpret Silhouette plot
- Interpret clusters by examining cluster centers, and exploring the data within each cluster (dataframe inspection, plotting, decision trees for cluster membership)
- Build and interpret a dendrogram using hierarchical clustering.
- Compare and contrast k-means and hierarchical clustering.
29. Churn Case Study
- List and explain the steps in CRISP-DM (Cross-Industry Standard Process for Data Mining)
- Perform EDA standards on case study including visualizations
- Discuss ramifications of deleting missing values when
- MAR (missing at random)
- MCAR (missing completely at random)
- MNAR (missing not at random)
- Explain imputing missing using at least 2 different methods, list pros and cons of each method
- Explain when dropping rows is okay, when dropping features is okay?
- Be able to perform the feature engineering process
- Be able to identify target leak, and explain why this happens
- State appropriate business goal and evaluation metric
30. Dimensionality Reduction
- List reasons for reducing the dimensions.
- Describe how the principal components are constructed in PCA.
- Interpret the principal components of PCA.
- Determine how many principal components to keep.
- Describe the relationship between PCA and SVD.
- Compute and interpret PCA using sklearn.
- Memorize the eigenvalue equation
31. NMF
- Write down and explain the NMF equation.
- Compare and contrast NMF, SVD, and PCA, and k-means
- Implement Alternating-Least-Squares algorithm for NMF
- Find and interpret latent topics in a corpus of documents with NMF
- Explain how to interpret H matrix? W matrix?
- Explain regularization in the context of NMF.
32. Recommender Systems
- Survey approaches to recommenders, their pros & cons, and when each is likely to be best.
- Describe the cold start problem and know how it affects different recommendation strategies
- Explain either the collaborative filtering algorithm or the matrix factorization recommender algorithm.
- Discuss recommender evaluation.
- Discuss performance concerns for recommenders.
33. Graphs
- Define a graph and discuss the implementation.
- List common applications of graph models.
- Discuss the searching algorithms and applications of them.
- Explain the various ways of measuring the importance of a node.
- Explain methods and applications of clustering on a graph.
- Use appropriate package to build graph data structure in Python and execute common algorithms (shortest path, connected components, …)
- Explain the various ways of measuring the importance of a node.
- Explain methods and applications of clustering on a graph.
34. Cloud Computing
- Scope & Configure a data science environment on AWS.
- Protect AWS resources against unauthorized access.
- Manage AWS resources using awscli, ssh, scp, or boto3.
- Monitor and control costs incurred on AWS
35. Parallel Computing
- Define and contrast processes vs. threads
- Define and contrast parallelism and concurrency.
- Recognize problems that require parallelism or concurrency
- Implement parallel and concurrent solutions
- Instrument approaches to see the benefit of threading/parallelism.
36. Map Reduce
- Explain Types of Problems which benefit from MapReduce
- Describe map-reduce, and how it relates to Hadoop
- Explain how to select the number of mappers and reducers
- Describe the role of keys in MapReduce
- Perform MapReduce in python using MRJob.
37. Time Series
- Recognize when time series analysis could be applied
- Define key times series concepts
- Determine structure of a time-series using graphical tools
- Compute a forecast using Box-Jenkins Methodology
- Evaluate models/forecasts using cross validation and statistical tests
- Engineer features to handle seasonal, calendar, and periodic components
- Explain taxonomy of exponential smoothing using ETS framework
38. Spark
- Configure a machine to use spark effectively
- Describe differences and similarities between MapReduce and Spark
- Get data into spark for processing.
- Describe lazy evaluation in the context of Spark.
- Spark stores transformations for future use, when called by an action
- Cache RDDs effectively to improve performance.
- Use Spark to compute basic statistics
- Know the difference between Spark data types: RDD, DataFrame, DAG
- RDD: Resilient Distributed Datasets; Distributed across cluster of partitions (chunks of data); Can recover from loss of individual partitions, nodes, slow processes; partitions immutable and traceable for repeatability.
- DAG: Directed Acyclic Graph;
- Use MLLib https://spark.apache.org/docs/latest/ml-guide.html
39. SQL in Spark
- Identify what distinguishes a Spark DataFrame from an RDD
- Explain how to create a Spark DataFrame
- Query a DF with SQL
- Transform a DF with dataframe methods
- Describe the challenges and requirements of saving schema’d datasets.
- Use user-defined functions
40. Data Products
- Explain REST architecture/API
- Write a basic Flask API
- Describe web architecture at a high level
- Know the role of javascript in a web application
- Know how to use developer tools to inspect an application
- Write a basic Flask web application
- Be able to describe the difference between online and offline computation
41. Fraud Case Study
- Build an MVP (minimum viable product) quickly
- Build a dashboard
- Build system to take in online data from a stream
- Build production-quality product
42. Whiteboarding
- Explain the meaning of Big-Oh.
- Big-Oh is a way to generalize the worst-case runtime scenario of algorithms.
- Cousin to Big-Omega and Big-Theta
- f(x) = O(g(x))
- MIT big oh lecture
- A gentle intro to computational complexity
- Computational complexity theory (Stanford)
- Beginner’s guide to design patterns
- http://www.oodesign.com/
- Analyze the runtime of code.
- Solve whiteboarding interview questions.
- Apply different techniques to addressing a whiteboarding interview problem
43. Business Analytics
- Explain funnel metrics and applications
- Identify red flags in a set of funnel metrics
- Identify and discuss appropriate use cases for cohort analysis
- Identify and explain the limits of data analysis
- Given an open ended question, identify the business goal, metrics, and relevant data science solution.
- Identify excessive or improper use of data analysis
- Explain how data science is used in industry
- Understand range of business problems where AB testing applies