Modeling the relative speed of Hot Wheels Cars

Sat 01 February 2025

I recently came across a really interesting modeling problem when playing with my kids. It all started when my 5yr old received this Hot Wheels T-Rex Transporter.

The transporter comes with a race-track where the winner triggers a trap door, dropping the loser to the ground. Obviously our first task …

Data Onboarding Checklist

Sun 10 October 2021

I'm a checklist addict. While our team has a document on acquiring different accounts, reviewing our workflows, reviewing code repos etc., this document doesn't provide any prioritization.

I created the following checklist (markdown) to give a condensed view of what a manager and new hire should accomplish by the end …

Posting Collections as Hive Tables

Mon 10 August 2020

I was recently asked to post a series of parquet collection as tables so analysts could query them in SQL. This should be straight forward, but it took me awhile to figure out. Hopefully, you find this post before spending too much time on such an easy task.

You should …

Balancing Model Weights in PySpark

Mon 18 November 2019

Imbalanced classes is a common problem. Scikit-learn provides an easy fix - "balancing" class weights. This makes models more likely to predict the less common classes (e.g., logistic regression).

The PySpark ML API doesn't have this same functionality, so in this blog post, I describe how to balance class weights …

Creating a CDF in PySpark

Mon 26 August 2019

CDFs are a useful tool for understanding your data. This tutorial will demonstrate how to create a CDF in PySpark.

I start by creating normally distributed, fake data.

 import numpy as np
 from pyspark.sql import SparkSession
 from pyspark import SparkContext
 from pyspark.sql import functions as F
 from pyspark …

Limiting Cardinality with a PySpark Custom Transformer

Fri 12 July 2019

When onehot-encoding columns in pyspark, column cardinality can become a problem. The size of the data often leads to an enourmous number of unique values. If a minority of the values are common and the majority of the values are rare, you might want to represent the rare values as …

Are Some MLB Players More Likely To Hit Into Errors: Statistics

Tue 04 June 2019

In a previous post, I described how to download and clean data for understanding how likely a baseball player is to hit into an error given that they hit the ball into play.

This analysis will statistically demonstrate that some players are more likely to hit into errors than others …

Data Science Lessons Learned the Hard Way: Coding

Sun 19 May 2019

You could summarize this post as "you will never regret good code practices" or "no project is too small for good code practices".

You might think these recommendations are not worth the time when a project seems small, but projects often grow over time. If you use good practices from …

Are some mlb players more likely to hit into errors than others: Data Munging

Fri 19 April 2019

I recently found myself wondering if some baseball players are more likely to hit into errors than others. In theory, the answer should be "no" since fielders produce errors regardless of who is hitting. Nonetheless, it's also possible that some hitters "force" errors by hitting the ball harder or running …

Complex Aggregations in PySpark

Tue 05 February 2019

I've touched on this in past posts, but wanted to write a post specifically describing the power of what I call complex aggregations in PySpark.

The idea is that you have have a data request which initially seems to require multiple different queries, but using 'complex aggregations' you can create …

← Older

Dan Vatterott

Data Scientist