Dan Vatterott

Data Scientist

Aggregating Sparse and Dense Vectors in Pyspark

Many (if not all of) pyspark’s machine learning algorithms require the input data is concatenated into a single column (using the vector assembler command). This is all well and good, but applying non-machine learning algorithms (e.g., any aggregations) to data in this format can be a real pain. Here, I describe how to aggregate (average in this case) data in sparse and dense vectors.

I start by importing the necessary libraries and creating a spark dataframe, which includes a column of sparse vectors. Note that I am using ml.linalg SparseVector and not the SparseVector from mllib. This makes a big difference!

1
2
3
4
5
6
7
8
9
10
11
12
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.ml.linalg import SparseVector, DenseVector
# note that using Sparse and Dense Vectors from ml.linalg. There are other Sparse/Dense vectors in spark.

df = sc.parallelize([
  (1, SparseVector(10, {1: 1.0, 2: 1.0, 3: 2.0, 4: 1.0, 5: 3.0})),
  (2, SparseVector(10, {9: 100.0})),
  (3, SparseVector(10, {1: 1.0})),
]).toDF(["row_num", "features"])

df.show()
row_num features
1 (10,[1,2,3,4,5],[1.0, 1.0, 2.0, 1.0, 3.0])
2 (10,[9],[100.0])
3 (10,[1],[1.0])

Next, I write a udf, which changes the sparse vector into a dense vector and then changes the dense vector into a python list. The python list is then turned into a spark array when it comes out of the udf.

1
2
3
4
5
6
7
8
9
def sparse_to_array(v):
  v = DenseVector(v)
  new_array = list([float(x) for x in v])
  return new_array

sparse_to_array_udf = F.udf(sparse_to_array, T.ArrayType(T.FloatType()))

df = df.withColumn('features_array', sparse_to_array_udf('features'))
df.show()
row_num features features_array
1 (10,[1,2,3,4,5],[1.0, 1.0, 2.0, 1.0, 3.0]) [0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0]
2 (10,[9],[100.0]) [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0]
3 (10,[1],[1.0]) [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Now that the data is in a pyspark array, we can apply the desired pyspark aggregation to each item in the array.

1
2
df_agg = df.agg(F.array(*[F.avg(F.col('features_array')[i]) for i in range(10)]).alias("averages"))
df_agg.show()
averages
[0.0, 0.66667, 0.33333, 0.66667, 0.33333, 1.0, 0.0, 0.0, 0.0, 33.33333]

Now, let’s run through the same exercise with dense vectors. We start by creating a spark dataframe with a column of dense vectors.

1
2
3
4
5
6
7
df = sc.parallelize([
  (1, DenseVector([0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])),
  (2, DenseVector([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0])),
  (3, DenseVector([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])),
]).toDF(["row_num", "features"])

df.show()
row_num features
1 [0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0]
2 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0]
3 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Next, we create another pyspark udf which changes the dense vector into a pyspark array.

1
2
3
4
5
6
7
8
def dense_to_array(v):
  new_array = list([float(x) for x in v])
  return new_array

dense_to_array_udf = F.udf(dense_to_array, T.ArrayType(T.FloatType()))

df = df.withColumn('features_array', dense_to_array_udf('features'))
df.show()
row_num features features_array
1 [0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0]
2 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0]
3 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Finally, we can use our standard pyspark aggregators to each item in the pyspark array.

1
2
df_agg = df.agg(F.array(*[F.avg(F.col('features_array')[i]) for i in range(10)]).alias("averages"))
df_agg.show()
averages
[0.0, 0.66667, 0.33333, 0.66667, 0.33333, 1.0, 0.0, 0.0, 0.0, 33.33333]

There we go! Hope you find this info helpful!

Integrating Apache Airflow and Databricks

Cron is great for automation, but when tasks begin to rely on each other (task C can only run after both tasks A and B finish) cron does not do the trick.

Apache Airflow is open source software (from airbnb) designed to handle the relationship between tasks. I recently setup an airflow server which coordinates automated jobs on databricks (great software for coordinating spark clusters). Connecting databricks and airflow ended up being a little trickier than it should have been, so I am writing this blog post as a resource to anyone else who attempts to do the same in the future.

For the most part I followed this tutorial from A-R-G-O when setting up airflow. Databricks also has a decent tutorial on setting up airflow. The difficulty here is that the airflow software for talking to databricks clusters (DatabricksSubmitRunOperator) was not introduced into airflow until version 1.9 and the A-R-G-O tutorial uses airflow 1.8.

Airflow 1.9 uses Celery version >= 4.0 (I ended up using Celery version 4.1.1). Airflow 1.8 requires Celery < 4.0. In fact, the A-R-G-O tutorial notes that using Celery >= 4.0 will result in the error:

1
airflow worker: Received and deleted unknown message. Wrong destination?!?

I can attest that this is true! If you use airflow 1.9 with Celery < 4.0, everything might appear to work, but airflow will randomly stop scheduling jobs after awhile (check the airflow-scheduler logs if you run into this). You need to use Celery >= 4.0! Preventing the Wrong destination error is easy, but the fix is hard to find (hence why I wrote this post).

After much ado, here’s the fix! If you follow the A-R-G-O tutorial, install airflow 1.9, celery >=4.0 AND set broker_url in airflow.cfg as follows:

1
broker_url = pyamqp://guest:[email protected]:5672//

Note that compared to the A-R-G-O tutorial, I am just adding “py” in front of amqp. Easy!

Random Weekly Reminders

I constantly use google calendar to schedule reminder emails, but I want some of my reminders to be stochastic!

Google calendar wants all their events to occur on a regular basis (e.g., every Sunday), but I might want a weekly reminder email which occurs on a random day each the week.

I wrote a quick set of python scripts which handle this situation.

The script find_days.py chooses a random day each week (over a month) on which a reminder email should be sent. These dates are piped to a text file (dates.txt). The script send_email.py reads this text file and sends a reminder email to me if the current date matches one of the dates in dates.txt.

I use cron to automatically run these scripts on a regular basis. Cron runs find_days.py on the first of each month and runs send_email.py every day. I copied my cron script as cron_job.txt.

I use mailutils and postfix to send the reminder emails from the machine. Check out this tutorial for how to set up a send only mail server. The trickiest part of this process was repeatedly telling gmail that my emails were not spam.

Now I receive my weekly reminder on an unknown date so I can act spontaneous!

Regression of a Proportion in Python

I frequently predict proportions (e.g., proportion of year during which a customer is active). This is a regression task because the dependent variables is a float, but the dependent variable is bound between the 0 and 1. Googling around, I had a hard time finding the a good way to model this situation, so I’ve written here what I think is the most straight forward solution.

I am guessing there’s a better way to do this with MCMC, so please comment below if you know a better way.

Let’s get started by importing some libraries for making random data.

1
2
from sklearn.datasets import make_regression
import numpy as np

Create random regression data.

1
2
3
4
5
6
7
8
9
rng = np.random.RandomState(0)  # fix random state
X, y, coef = make_regression(n_samples=10000,
                             n_features=100,
                             n_informative=40,
                             effective_rank= 15,
                             random_state=0,
                             noise=4.0,
                             bias=100.0,
                             coef=True)

Shrink down the dependent variable so it’s bound between 0 and 1.

1
2
3
4
y_min = min(y)
y = [i-y_min for i in y]  # min value will be 0
y_max = max(y)
y = [i/y_max for i in y]  # max value will be 1

Make a quick plot to confirm that the data is bound between 0 and 1.

1
2
3
4
5
6
7
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')

plt.hist(y);

All the data here is fake which worries me, but beggars can’t be choosers and this is just a quick example.

Below, I apply a plain GLM to the data. This is what you would expect if you treated this as a plain regression problem

1
2
3
4
5
import statsmodels.api as sm

linear_glm = sm.GLM(y, X)
linear_result = linear_glm.fit()
# print(linear_result.summary2())  # too much output for a blog post

Here’s the actual values plotted (x-axis) against the predicted values (y-axis). The model does a decent job, but check out the values on the y-axis - the linear model predicts negative values!

1
plt.plot(y, linear_result.predict(X), 'o', alpha=0.2);

Obviously the linear model above isn’t correctly modeling this data since it’s guessing values that are impossible.

I followed this tutorial which recommends using a GLM with a logit link and the binomial family. Checking out the statsmodels module reference, we can see the default link for the binomial family is logit.

Below I apply a GLM with a logit link and the binomial family to the data.

1
2
3
binom_glm = sm.GLM(y, X, family=sm.families.Binomial())
binom_results = binom_glm.fit()
#print(binom_results.summary2())  # too much output for a blog post

Here’s the actual data (x-axis) plotted against teh predicted data. You can see the fit is much better!

1
plt.plot(y, binom_results.predict(X), 'o', alpha=0.2);

1
2
%load_ext watermark
%watermark -v -m -p numpy,matplotlib,sklearn,seaborn,statsmodels
CPython 3.6.3
IPython 6.1.0

numpy 1.13.3
matplotlib 2.0.2
sklearn 0.19.1
seaborn 0.8.0
statsmodels 0.8.0

compiler   : GCC 7.2.0
system     : Linux
release    : 4.13.0-38-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit

Exploring ROC Curves

I’ve always found ROC curves a little confusing. Particularly when it comes to ROC curves with imbalanced classes. This blog post is an exploration into receiver operating characteristic (i.e. ROC) curves and how they react to imbalanced classes.

I start by loading the necessary libraries.

1
2
3
4
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
%matplotlib inline

Seed the random number generator so that everything here is reproducible.

1
np.random.seed(seed=1)

I write a few functions that will create fake date, plot fake date, and plot ROC curves.

I describe each function in turn below:

  • grab_probability draws a sample of "probabilities" drawn from a uniform distribution bound between 0 and 1.
  • create_fake_binary_data creates a vector of 0s and 1s. The mean of the vector is controlled by the positive input.
  • probability_hist plots a normalized histogram (each bar depicts the proportion of data in it) bound between 0 and 1.
  • plot_roc_curve does not need an explanation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def grab_probability(sample_size):
    """Draw probabilties"""
    return np.random.random(size=(sample_size,))

def create_fake_binary_data(positive, sample_size):
    """Create a vector of binary data with the mean specified in positive"""
    negative = 1-positive
    y = np.ones(sample_size)
    y[:int(negative*sample_size)] = 0
    np.random.shuffle(y)
    return y

def probability_hist(probs):
    """Create histogram of probabilities"""
    fig = plt.Figure()
    weights = np.ones_like(probs)/float(len(probs))
    plt.hist(probs, weights=weights)
    plt.xlim(0, 1)
    plt.ylim(0, 1);

def plot_roc_curve(fpr, tpr, roc_auc, lw=2):
    """Plot roc curve"""
    lw = lw
    fig = plt.Figure()
    plt.plot(fpr, tpr, color='darkorange',
             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right");

I have found one of the best ways to learn about an algorithm is to give it fake data. That way, I know the data, and can examine exactly what the algorithm does with the data. I then change the data and examine how the algorithm reacts to this change.

The first dataset I create is random data with balanced classes.

I create probability with the grab_probability function. This is a vector of numbers between 0 and 1. These data are meant to simulate the probabilities that would be produced by a model that is no better than chance.

I also create the vector y which is random ones and zeroes. I will call the ones the positive class and the zeroes the negative class.

The plot below is a histogram of probability. The y-axis is the proportion of samples in each bin. The x-axis is probability levels. You can see the probabilities appear to be from a uniform distribution.

1
2
3
4
5
6
7
sample_size = 1000
positive = 0.5

y = create_fake_binary_data(positive, sample_size)
probability = grab_probability(sample_size)

probability_hist(probability)

There’s no association between y and the probability, so I don’t expect the area under the curve to be different than chance (i.e., have an area under the curve of about 0.5). I plot the ROC curve to confirm this below.

1
2
3
4
fpr, tpr, thresholds = roc_curve(y, probability)
roc_auc = auc(fpr, tpr)

plot_roc_curve(fpr, tpr, roc_auc)

Let’s talk about the axes here. The y-axis is the proportion of true positives (i.e., TPR - True Positive Rate). This is how often the model correctly identifies members of the positive class. The x-axis is the proportion of false positives (FPR - False Positive Rate). This how often the model incorrectly assigns examples to the positive class.

One might wonder how the TPR and FPR can change. Doesn’t a model always produce the same guesses? The TPR and FPR can change because we can choose how liberal or conservative the model should be with assigning examples to the positive class. The lower left-hand corner of the plot above is when the model is maximally conservative (and assigns no examples to the positive class). The upper right-hand corner is when the model is maximally liberal and assigns every example to the positive class.

I used to assume that when a model is neutral in assigning examples to the positive class, that point would like halfway between the end points, but this is not the case. The threshold creates points along the curve, but doesn’t dictate where these points lie. If this is confusing, continue to think about it as we march through the proceeding plots.

The ROC curve is the balance between true and false positives as a threshold varies. To help visualize this balance, I create a function which plots the two classes as a stacked histogram, cumulative density functions, and the relative balance between the two classes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def probability_histogram_class(probability, y):
    plt.subplot(221)
    counts, bins, _ = plt.hist([probability[y==0], probability[y==1]], stacked=True)
    plt.xlim(np.min(bins),np.max(bins))
    plt.xticks([])

    plt.subplot(222)
    plt.hist(probability[y==1], cumulative=True, normed=True, color='tab:orange')
    plt.xlim(np.min(bins),np.max(bins))
    plt.xticks([])
    plt.ylim(0,1)

    plt.subplot(224)
    plt.hist(probability[y==0], cumulative=True, normed=True, color='tab:blue')
    plt.xlim(np.min(bins),np.max(bins))
    plt.xticks()
    plt.ylim(0,1)

    plt.subplot(223)
    proportion = counts[0]/[max(0.0001, x) for x in counts[1]]
    plt.plot(bins[:-1], 1-proportion)
    plt.xlim(np.min(bins),np.max(bins))
    plt.ylim(0,1);

The idea behind this plot is we can visualize the model’s threshold moving from LEFT to RIGHT through the plots. As the threshold decreases, the model will guess the positive class more often. This means more and more of each class will be included when calculating the numerator of TPR and FPR.

The top left plot is a stacked histogram. Orange depicts members of the positive class and blue depicts members of the negative class. On the x-axis (of all four plots) is probability.

If we continue thinking about the threshold as decreasing as the plots moves from left to right, we can think of the top right plot (a reversed CDF of the positive class) as depicting the proportion of the positive class assigned to the positive class as the threshold varies (setting the TPR). We can think of the bottom right plot (a reversed CDF of the negative class) as depicting the proportion of the negative class assigned to the positive class as the threshold varies (setting the FPR).

In the bottom left plot, I plot the proportion of positive class that falls in each bin from the histogram in the top plot. Because the proportion of positive and negative class are equal as the threshold varies (as depicted in the bottom plot) we consistently assign both positive and negative examples to the positive class at equal rates and the ROC stays along the identity and the area under the curve is 0.5.

1
2
3
probability = grab_probability(sample_size)

probability_histogram_class(probability, y)

Next, I do the same process as above but with fake probabilities that are predictive of the label. The function biased_probability produces probabilities that tend to be greater for the positive class and lesser for the negative class.

1
2
3
4
5
6
7
def biased_probability(y):
    """Return probabilities biased towards correct answer"""
    probability = np.random.random(size=(len(y),))
    probability[y==1] = np.random.random(size=(int(sum(y)),)) + 0.25
    probability[y==0] = np.random.random(size=(int(sum(y==0)),)) - 0.25
    probability = np.array([max(0, min(1, i)) for i in probability])
    return probability

I create this data for a balanced class problem again. using the same y vector, I adjust the probabilities so that they are predcitive of the values in this y vector. Below, you can see the probability data as a histogram. The data no longer appear to be drawn from a uniform distribution. Instead, there are modes near 0 and 1.

1
2
3
probability = biased_probability(y)

probability_hist(probability)

Now, we get a nice roc curve which leaves the identity line. Not surprising since I designed the probabilities to be predictive. Notice how quickly the model acheives a TPR of 1. Remember this when looking at the plots below.

1
2
3
4
fpr, tpr, _ = roc_curve(y, probability)
roc_auc = auc(fpr, tpr)

plot_roc_curve(fpr, tpr, roc_auc)

In the upper left plot below, we can clearly see that the positive class occurs more often than the negative class on the right side of the plot.

Now remember that the lower left hand side of the roc plot is when we are most conservative. This corresponds to the right hand side of these plots where the model is confident that these examples are from the positive class.

If we look at the cdfs of right side. We can see the positive class (in orange) has many examples on the right side of these plots while the negative class (in blue) has no examples on this side. This is why the TPR immediately jumps to about 0.5 in the roc curve above. We also see the positive class has no examples on the left side of these plots while the negative class has many. This is why the TPR saturates at 1 well before the FPR does.

In other words, because there model is quite certain that some examples are from the positive class the ROC curve quickly jumps up on the y-axis. Because the model is quite certain as to which examples are from the negative class, the ROC curves saturates on the y-axis well before the end of the x-axis.

1
2
3
probability = biased_probability(y)

probability_histogram_class(probability, y)

After those two examples, I think we have a good handle on the ROC curve in the balanced class situation. Now let’s make some fake data when the classes are unbalanced. The probabilities will be completely random.

1
2
3
4
5
6
7
8
9
10
sample_size = 1000
positive = 0.7

y = create_fake_binary_data(positive, sample_size)
probability = grab_probability(sample_size)

print('Average Test Value: %0.2f' % np.mean(y))
print('Average Probability: %0.2f' % np.mean(probability))

probability_hist(probability)
Average Test Value: 0.70
Average Probability: 0.49

Again, this is fake data, so the probabilities do not reflect the fact that the classes are imbalanced.

Below, we can see that the ROC curve agrees that the data are completely random.

1
2
3
4
fpr, tpr, _ = roc_curve(y, probability)
roc_auc = auc(fpr, tpr)

plot_roc_curve(fpr, tpr, roc_auc)

1
probability_histogram_class(probability, y)

Now, lets create biased probabilities and see if the ROC curve differs from chance

1
2
3
4
5
6
7
8
from sklearn.utils import shuffle

probability = biased_probability(y)

fpr, tpr, _ = roc_curve(y, probability)
roc_auc = auc(fpr, tpr)

plot_roc_curve(fpr, tpr, roc_auc)

It does as we expect.

1
probability_histogram_class(probability, y)

Importantly, the probabilities now reflect the biased classes

1
print(np.mean(probability))
0.602536255717

Using these same probabilities, lets remove the relationship between the probabilities and the output variable by shuffling the data.

1
y = shuffle(y)
1
2
3
4
fpr, tpr, _ = roc_curve(y, probability)
roc_auc = auc(fpr, tpr)

plot_roc_curve(fpr, tpr, roc_auc)

Beautiful! the ROC curve stays on the identity line. We can see that this is because while the positive class is predicted more often, the positive class is evently distributed across the different thresholds.

1
probability_histogram_class(probability, y)

Importantly, this demonstrates that even with imbalanced classes, if a model is at chance, then the ROC curve will reflect this chance perforomance. I do a similar demonstration with fake data here.

1
2
%load_ext watermark
%watermark -v -m -p numpy,matplotlib,sklearn
CPython 3.6.3
IPython 6.1.0

numpy 1.13.3
matplotlib 2.0.2
sklearn 0.19.1

compiler   : GCC 7.2.0
system     : Linux
release    : 4.13.0-36-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit

'Is Not in' With Pyspark

In SQL it’s easy to find people in one list who are not in a second list (i.e., the “not in” command), but there is no similar command in pyspark. Well, at least not a command that doesn’t involve collecting the second list onto the master instance.

Here is a tidbit of code which replicates SQL’s “not in” command, while keeping your data with the workers (it will require a shuffle).

I start by creating some small dataframes.

1
2
3
4
import pyspark
from pyspark.sql import functions as F
a = sc.parallelize([[1, 'a'], [2, 'b'], [3, 'c']]).toDF(['id', 'valueA'])
b = sc.parallelize([[1, 'a'], [4, 'd'], [5, 'e']]).toDF(['id', 'valueB'])

Take a quick look at dataframe a.

1
a.show()
id valueA
1 a
2 b
3 c

And dataframe b.

1
b.show()
id valueA
1 a
4 d
5 e

I create a new column in a that is all ones. I could have used an existing column, but this way I know the column is never null.

1
2
a = a.withColumn('inA', F.lit(1))
a.show()
id valueA inA
1 a 1
2 b 1
3 c 1

I join a and b with a left join. This way all values in b which are not in a have null values in the column “inA”.

1
b.join(a, 'id', 'left').show()
id valueA valueB inA
5 e null null
1 a a 1
4 d null null

By filtering out rows in the new dataframe c, which are not null, I remove all values of b, which were also in a.

1
2
c = b.join(a, 'id', 'left').filter(F.col('inA').isNull())
c.show()
id valueA valueB inA
5 e null null
4 d null null

EDIT I recently gave the pyspark documentation a more thorough reading and realized that pyspark’s join command has a left_anti option. The left_anti option produces the same functionality as described above, but in a single join command (no need to create a dummy column and filter).

For example, the following code will produce rows in b where the id value is not present in a.

1
c = b.join(a, 'id', 'left_anit')

Psychology to Data Science: Part 2

This is the second post in a series of posts about moving from a PhD in Psychology/Cognitive Psychology/Cognitive Neuroscience to data science. The first post answers many of the best and most common questions I’ve heard about my transition. This post focuses on the technical skills that are often necessary for landing a data science job.

Each header in this post represents a different technical area. Following the header I describe what I would know before walking into an interview.

SQL

SQL is not often used in academia, but it’s probably the most important skill in data science (how do you think you’ll get your data??). It’s used every day by data scientists at every company, and while it’s 100% necessary to know, it’s stupidly boring to learn. But, once you get the hang of it, it’s a fun language because it requires a lot of creativity. To learn SQL, I would start by doing the mode analytics tutorials, then the sql zoo problems. Installing postgres on your personal computer and fetching data in Python with psycopg2 or sql-alchemy is a good idea. After, completing all this, move onto query optimization (where the creativity comes into play) - check out the explain function and order of execution. Shameless self promotion: I made a SQL presentation on what SQL problems to know for job interviews.

Python/R

Some places use R. Some places use Python. It sucks, but these languages are not interchangeable (an R team will not hire someone who only knows Python). Whatever language you choose, you should know it well because this is a tool you will use every day. I use Python, so what follows is specific to Python.

I learned Python with codeacademy and liked it. If you’re already familiar with Python I would practice “white board” style questions. Feeling comfortable with the beginner questions on a site like leetcode or hackerrank would be a good idea. Writing answers while thinking about code optimization is a plus.

Jeff Knupp’s blog has great tid-bits about developing in python; it’s pure gold.

Another good way to learn is to work on your digital profile. If you haven’t already, I would start a blog (I talk more about this is Post 1).

Statistics/ML

When starting here, the Andrew Ng coursera course is a great intro. While it’s impossible to learn all of it, I love to use elements of statistical learning and it’s sibling book introduction to statistical learning as a reference. I’ve heard good things about Python Machine Learning but haven’t checked it out myself.

As a psychology major, I felt relatively well prepared in this regard. Experience with linear-mixed effects, hypothesis-testing, regression, etc. serves Psychology PhDs well. This doesn’t mean you can forget Stats 101 though. Once, I found myself uncomfortably surprised by a very basic probability question.

Here’s a quick list of Statistics/ML algorithms I often use: GLMs and their regularization methods are a must (L1 and L2 regularization probably come up in 75% of phone screens). Hyper-parameter search. Cross-validation! Tree-based models (e.g., random forests, boosted decision trees). I often use XGBoost and have found its intro post helpful.

I think you’re better off deeply (pun not intended) learning the basics (e.g., linear and logistic regression) than learning a smattering of newer, fancier methods (e.g., deep learning). This means thinking about linear regression from first principles (what are the assumptions and given these assumptions can you derive the best-fit parameters of a linear regression?). I can’t tell you how many hours I’ve spent studying Andrew Ng’s first supervised learning lecture for this. It’s good to freshen up on linear algebra and there isn’t a better way to do this than the 3Blue1Brown videos; they’re amazing. This might seem too introductory/theoretical, but it’s necessary and often comes up in interviews.

Be prepared to talk about the bias-variance tradeoff. Everything in ML comes back to the bias-variance tradeoff so it’s a great interview question. I know some people like to ask candidates about feature selection. I think this question is basically a rephrasing of the bias-variance tradeoff.

Git/Code Etiquette

Make a github account if you haven’t already. Get used to commits, pushing, and branching. This won’t take long to get the hang of, but, again, it’s something you will use every day.

As much as possible I would watch code etiquette. I know this seems anal, but it matters to some people (myself included), and having pep8 quality code can’t hurt. There’s a number of python modules that will help here. Jeff Knupp also has a great post about linting/automating code etiquette.

Unit-tests are a good thing to practice/be familiar with. Like usual, Jeff Knupp has a great post on the topic.

I want to mention that getting a data science job is a little like getting a grant. Each time you apply, there is a low chance of getting the job/grant (luckily, there are many more jobs than grants). When creating your application/grant, it’s important to find ways to get people excited about your application/grant (e.g., showing off your statistical chops). This is where code etiquette comes into play. The last thing you want is to diminish someone’s excitement about you because you didn’t include a doc string. Is code etiquette going to remove you from contention for a job? Probably not. But it could diminish someone’s excitement.

Final Thoughts

One set of skills that I haven’t touched on is cluster computing (e.g., Hadoop, Spark). Unfortunately, I don’t think there is much you can do here. I’ve heard good things about the book Learning Spark, but books can only get you so far. If you apply for a job that wants Spark, I would install Spark on your local computer and play around, but it’s hard to learn cluster computing when you’re not on a cluster. Spark is more or less fancy SQL (aside from the ML aspects), so learning SQL is a good way to prepare for a Spark mindset. I didn’t include cluster computing above, because many teams seem okay with employees learning this on the job.

Not that there’s a lack of content here, but here’s a good list of must know topics that I used when transitioning from academia to data science.

Psychology to Data Science: Part 1

A number of people have asked about moving from a PhD in Psychology/Cognitive Psychology/Cognitive Neuroscience to data science. This blog post is part of a 2-part series where I record my answers to the best and most common questions I’ve heard. Part 2 can be found here.

Before I get started, I want to thank Rick Wolf for providing comments on an earlier version of this post.

This first post is a series of general questions I’ve received. The second post will focus on technical skills required to get a job in data science.

Each header in this post represents a question. Below the header/question I record my response.

Anyone starting this process should know they are starting a marathon. Not a sprint. Making the leap from academia to data science is more than possible, but it takes time and dedication.

Do you think that being a Psychology PhD is a disadvantage?

I think it can be a disadvantage in the job application process. Most people don’t understand how quantitative Psychology is, so psychology grads have to overcome these stereotypes. This doesn’t mean having a Psychology PhD is a disadvantage when it comes to BEING a data scientist. Having a Psychology PhD can be a huge advantage because Psychology PhDs have experience measuring behavior which is 90% of data science. Every company wants to know what their customers are doing and how to change their customers’ behavior. This is literally what Psychology PhDs do, so Psychology PhDs might have the most pertinent experience of any science PhD.

When it is the right time to apply for a boot camp?

(I did the Insight Data Science bootcamp)
Apply when you’re good enough to get a phone screen but not good enough to get a job. Don’t count on a boot camp to give you all the skills. Instead, think of boot camps as polishing your skills.

Here is the game plan I would use:
Send out 3-4 job applications and see if you get any hits. If not, think about how you can improve your resume (see post #2), and go about those improvements. After a few iterations of this, you will start getting invitations to do phone screens. At this stage, a boot camp will be useful.
The boot camps are of varying quality. Ask around to get an idea for which boot camps are better or worse. Also, look into how each boot camp gets paid. If you pay tuition, the boot camp will care less about whether you get a job. If the boot camp gets paid through recruiting fees or collecting tuition from your paychecks, it is more invested in your job.

Should I start a blog?

Yes, I consider this a must (and so do others). It’s a good opportunity to practice data science, and, more importantly, it’s a good opportunity to show off your skills.

Most people (including myself) host their page on github and generate the html with a static site generator. I use octopress, which works great. Most people seem to use pelican. I would recommend pelican because it’s built in Python. I haven’t used it, but a quick google search led me to this tutorial on building a github site with pelican.

I wish I’d sent more of my posts to friends/colleagues. Peer review is always good for a variety of reasons. I’d be more than happy to review posts for anyone reading this blog.

How should I frame what I’ve done in academia on my CV/resume?

First, no one in industry cares about publications. People might notice if the journal is Science/Nature but most will not. Spend a few hours thinking about how to describe your academic accomplishments as technical skills. For example, as a Postdoc, I was on a Neurophysiology project that required writing code to collect, ingest, and transform electrophysiology data. In academia, none of this code mattered. In industry, it’s the only thing that matters. What I built was a data-pipeline, and this is a product many companies desire.

We all have examples like this, but they’re not obvious because academics don’t know what companies want. Think of your data-pipelines, your interactive experiments, your scripted analytics.

Transforming academic work into skills that companies desire will take a bit of creativity (I am happy to help with this), but remember that your goal here is to express how the technical skills you used in academia will apply to what you will do as a data scientist.

Many people (including myself) love to say they can learn fast. While this is an important skill it’s hard to measure and it calls attention to what you do not know. In general, avoid it.

Did you focus on one specific industry?

I think a better question than what industry is what size of team/company you want to work on. At a big company you will have a more specific job with more specific requirements (and probably more depth of knowledge). At a smaller company, you will be expected to have a broader skill set. This matters in terms of what you want in a job and what skills you have. Having industry specific knowledge is awesome, but most academics have never worked in an industry so by definition they don’t have industry specific knowledge. Unfortunately, we just have to punt on this aspect of the job application.

Anything to be wary of?

No matter what your job is, having a good boss is important. If you get a funny feeling about a potential boss in the interview process, don’t take the job.

Some companies are trying to hire data scientists but don’t want to change their company. By this I mean they want their data scientists to work in excel. Excel is a great tool, but it’s not a tool I would want to use every day. If you feel the same way, then keep an eye out for this.

Using Cron to Automate Jobs on Ubuntu

I recently spent an entire afternoon debugging a solution for automatically launching a weekly emr job.

Hopefully, I can save someone the same pain by writing this blog post.

I decided to use Cron to launch the weekly jobs. Actually launching a weekly job on Cron was not difficult. Check out the Ubuntu Cron manual for a good description on how to use Cron.

What took me forever was realizing that Cron jobs have an extremely limited path. Because of this, specifying the complete path to executed files and their executors is necessary.

Below I describe how I used an ec2 instance (Ubuntu 16.04) to automatically launch this weekly job.

First, here is what my Cron job list looks like (call “crontab -e” in the terminal).

1
2
SHELL=/bin/bash
05 01 * * 2 $HOME/automated_jobs/production_cluster.sh

The important thing to note here is that I am creating the variable SHELL, and $HOME is replaced by the actual path to my home directory.

Next, is the shell script called by Cron.

1
2
3
4
#!/bin/bash
source $HOME/.bash_profile

python $HOME/automated_jobs/launch_production_cluster.py

Again, $HOME is replaced with the actual path to my home directory.

I had to make this shell script and the python script called within it executable (call “chmod +x” in the terminal). The reason that I used this shell script rather than directly launching the python script from Cron is I wanted access to environment variables in my bash_profile. In order to get access to them, I had to source bash_profile.

Finally, below I have the python file that executes the week job that I wanted. I didn’t include the code that actually launches our emr cluster because that wasn’t the hard part here, but just contact me if you would like to see it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!$HOME/anaconda2/bin/python
import os
import sys
import datetime as dt
from subprocess import check_output

# setup logging
old_stdout = sys.stdout
log_file = open("production_cluster_%s.log" % dt.datetime.today().strftime('%Y_%m_%d'), "w")
sys.stdout = log_file

print 'created log file'

# organize local files and s3 files

print 'organized files'

# call emr cluster

print 'launched production job'

# close log file
sys.stdout = old_stdout
log_file.close()

While the code is not included here, I use aws cli to launch my emr cluster, and I had to specify the path to aws (call “which aws” in the terminal) when making this call.

You might have noticed the logging I am doing in this script. I found logging both within this python script and piping the output of this script to additional logs helpful when debugging.

The Ubuntu Cron manual I linked above, makes it perfectly clear that my Cron path issues are common, but I wanted to post my solution in case other people needed a little guidance.

Are We in a TV Golden Age?

I recently found myself in a argument with my wife regarding whether TV was better now than previously. I believed that TV was better now than 20 years ago. My wife contended that there was simply more TV content being produced, and that this led to more good shows, but shows are not inherently any better.

This struck me as a great opportunity to do some quick data science. For this post, I scraped the names (from wikipedia) and ratings (from TMDb) of all American TV shows. I did the same for major American movies, so that I could have a comparison group (maybe all content is better or worse). The ratings are given by TMDb’s users and are scores between 1 and 10 (where 10 is a great show/movie and 1 is a lousy show/movie).

All the code for this post can be found on my github.

I decided to operationalize my “golden age of TV” hypothesis as the average TV show is better now than previously. This would be expressed as a positive slope (beta coefficient) when building a linear regression that outputs the rating of a show given the date on which the show first aired. My wife predicted a slope near zero or negative (shows are no better or worse than previously).

Below, I plot the ratings of TV shows and movies across time. Each show is a dot in the scatter plot. Show rating (average rating given my TMBb) is on the y-axis. The date of the show’s first airing is on the x-axis. When I encountered shows with the same name, I just tacked a number onto the end. For instance, show “x” would become show “x_1.” The size of each point in the scatter plot is the show’s “popularity”, which is a bit of a black box, but it’s given by TMBb’s API. TMDb does not give a full description of how they calculate popularity, but they do say its a function of how many times an item is viewed on TMDb, how many times an item is rated, and how many times the item has been added to watch or favorite list. I decided to depict it here just to give the figures a little more detail. The larger the dot, the more popular the show.

Here’s a plot of all TV shows across time.

To test the “golden age of TV” hypothesis, I coded up a linear regression in javascript (below). I put the regression’s output as a comment at the end of the code. Before stating whether the hypothesis was rejected or not, I should note that that I removed shows with less than 10 votes because these shows had erratic ratings.

As you can see, there is no evidence that TV is better now that previously. In fact, if anything, this dataset says that TV is worse (but more on this later).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
function linearRegression(y,x){

    var lr = {};
    var n = y.length;
    var sum_x = 0;
    var sum_y = 0;
    var sum_xy = 0;
    var sum_xx = 0;
    var sum_yy = 0;

    for (var i = 0; i < y.length; i++) {

        sum_x += x[i];
        sum_y += y[i];
        sum_xy += (x[i]*y[i]);
        sum_xx += (x[i]*x[i]);
        sum_yy += (y[i]*y[i]);
    }

    lr['slope'] = (n * sum_xy - sum_x * sum_y) / (n*sum_xx - sum_x * sum_x);
    lr['intercept'] = (sum_y - lr.slope * sum_x)/n;
    lr['r2'] = Math.pow((n*sum_xy - sum_x*sum_y)/Math.sqrt((n*sum_xx-sum_x*sum_x)*(n*sum_yy-sum_y*sum_y)),2);

    return lr;

};

var yval = data
    .filter(function(d) { return d.vote_count > 10 })
    .map(function (d) { return parseFloat(d.vote_average); });
var xval = data
    .filter(function(d) { return d.vote_count > 10 })
    .map(function (d) { return d.first_air_date.getTime() / 1000; });
var lr = linearRegression(yval,xval);
// Object { slope: -3.754543948800799e-10, intercept: 7.0808230581192815, r2: 0.038528573017115 }

I wanted to include movies as a comparison to TV. Here’s a plot of all movies across time.

It’s important to note that I removed all movies with less than 1000 votes. This is completely 100% unfair, BUT I am very proud of my figures here and things get a little laggy when including too many movies in the plot. Nonetheless, movies seem to be getting worse over time! More dramatically than TV shows!

1
2
3
4
5
6
7
8
var yval = data
    .filter(function(d) { return d.vote_count > 1000 })
    .map(function (d) { return parseFloat(d.vote_average); });
var xval = data
    .filter(function(d) { return d.vote_count > 1000 })
    .map(function (d) { return d.first_air_date.getTime() / 1000; });
var lr = linearRegression(yval,xval);
// Object { slope: -8.11645196776367e-10, intercept: 7.659366705415847, r2: 0.16185069580043676 }

Okay, so this was a fun little analysis, but I have to come out and say that I wasn’t too happy with my dataset and the conclusions we can draw from this analysis are only as good as the dataset.

The first limitation is that recent content is much more likely to receive a rating than older content, which could systematically bias the ratings of older content (e.g., only good shows from before 2000 receive ratings). It’s easy to imagine how this would lead us to believing that all older content is better than it actually was.

Also, TMDb seems to have IMDB type tastes by which I mean its dominated by young males. For instance, while I don’t like the show “Keeping up the Kardashians,” it’s definitely not the worst show ever. Also, “Girls” is an amazing show which gets no respect here. The quality of a show is in the eye of the beholder, which in this case seems to be boys.

I would have used Rotten Tomatoes’ API, but they don’t provide access to TV ratings.

Even with all these caveats in mind, it’s hard to defend my “golden age of TV” hypothesis. Instead, it seems like there is just more content being produced, which leads to more good shows (yay!), but the average show is no better or worse than previously.