Startup Docker and navigate to your jupyter notebook!

Create Spark Session.

SparkSession Docs

This is the container that will read in dataframes, register tables, execute commands on tables, cache tables, etc.

I think of it as my spark kernel (like you have a python kernel when you enter 'python' into the terminal)

Navigate to http://localhost:4040/jobs/ in your browser.

Spend lots of time here!!

Monitoring Docs

This is the first place I look when optimizing a job.

We can see jobs, stages (occur within jobs), cached tables, individual executors, and lineage of dataframes.

Read data into spark.

DataFrameReader Docs

Instructions for how Spark should read the data.

Remember that all your executors will receive a (random?) portion of the data.

Quick note on data structures.

RDDs

DataFrames

So far so good??