Dan Vatterott

Data Scientist

Introducing Predeval

Predeval is software designed to help you identify changes in a model's output.

For instance, you might be tasked with building a model to predict churn. When you deploy this model in production, you have to wait to learn which users churned in order to know how your model performed …

Creating a Survival Function in PySpark

Traditionally, survival functions have been used in medical research to visualize the proportion of people who remain alive following a treatment. I often use them to understand the length of time between users creating and cancelling their subscription accounts.

Here, I describe how to create a survival function using PySpark …

Custom Email Alerts in Airflow

Apache Airflow is great for coordinating automated jobs, and it provides a simple interface for sending email alerts when these jobs fail. Typically, one can request these emails by setting email_on_failure to True in your operators.

These email alerts work great, but I wanted to include additional links in them …

Aggregating Sparse and Dense Vectors in PySpark

Many (if not all of) PySpark's machine learning algorithms require the input data is concatenated into a single column (using the vector assembler command). This is all well and good, but applying non-machine learning algorithms (e.g., any aggregations) to data in this format can be a real pain. Here …

Integrating Apache Airflow and Databricks

Cron is great for automation, but when tasks begin to rely on each other (task C can only run after both tasks A and B finish) cron does not do the trick.

Apache Airflow is open source software (from airbnb) designed to handle the relationship between tasks. I recently setup …

Random Weekly Reminders

I constantly use google calendar to schedule reminder emails, but I want some of my reminders to be stochastic!

Google calendar wants all their events to occur on a regular basis (e.g., every Sunday), but I might want a weekly reminder email which occurs on a random day each …

Regression of a Proportion in Python

I frequently predict proportions (e.g., proportion of year during which a customer is active). This is a regression task because the dependent variables is a float, but the dependent variable is bound between the 0 and 1. Googling around, I had a hard time finding the a good way …

Exploring ROC Curves

I've always found ROC curves a little confusing. Particularly when it comes to ROC curves with imbalanced classes. This blog post is an exploration into receiver operating characteristic (i.e. ROC) curves and how they react to imbalanced classes.

I start by loading the necessary libraries.

 import numpy as np …

'Is Not In' with PySpark

In SQL it's easy to find people in one list who are not in a second list (i.e., the "not in" command), but there is no similar command in PySpark. Well, at least not a command that doesn't involve collecting the second list onto the master instance.

EDIT
Check …