What is a shuffle??

Distributed computing worst best when executors work indepedently.

Shuffles are operations when the executors CANNOT workly indepedently.

A shuffle is when your executors must communicate with the driver (and each other).

Joins, GroupBys, Orderbys, Window functions, Repartitions, Aggregations.

Revisiting the execution plan.

Notice the 2 hash aggregations and the hash exchange.

Shuffle operations also incur IO costs.