Dan Vatterott

Data Scientist

Data Onboarding Checklist

I'm a checklist addict. While our team has a document on acquiring different accounts, reviewing our workflows, reviewing code repos etc., this document doesn't provide any prioritization.

I created the following checklist (markdown) to give a condensed view of what a manager and new hire should accomplish by the end …

Posting Collections as Hive Tables

I was recently asked to post a series of parquet collection as tables so analysts could query them in SQL. This should be straight forward, but it took me awhile to figure out. Hopefully, you find this post before spending too much time on such an easy task.

You should …

Creating a CDF in PySpark

CDFs are a useful tool for understanding your data. This tutorial will demonstrate how to create a CDF in PySpark.

I start by creating normally distributed, fake data.

 import numpy as np
 from pyspark.sql import SparkSession
 from pyspark import SparkContext
 from pyspark.sql import functions as F
 from pyspark …

Data Science Lessons Learned the Hard Way: Coding

You could summarize this post as "you will never regret good code practices" or "no project is too small for good code practices".

You might think these recommendations are not worth the time when a project seems small, but projects often grow over time. If you use good practices from …

Complex Aggregations in PySpark

I've touched on this in past posts, but wanted to write a post specifically describing the power of what I call complex aggregations in PySpark.

The idea is that you have have a data request which initially seems to require multiple different queries, but using 'complex aggregations' you can create …