I’ve touched on this in past posts, but wanted to write a post specifically describing the power of what I call complex aggregations in PySpark.
The idea is that you have have a data request which initially seems to require multiple different queries, but using ‘complex aggregations’ you can create the requested data using a single query (and a single shuffle).
Let’s say you have a dataset like the following. You have one column (id) which is a unique key for each user, another column (group) which expresses the group that each user belongs to, and finally (value) which expresses the value of each customer. I apologize for the contrived example.
1 2 3 4 5 6 7 8 9 10 11 12 13
Let’s say someone wants the average value of group a, b, and c, AND the average value of users in group a OR b, the average value of users in group b OR c AND the value of users in group a OR c. Adds a wrinkle, right? The ‘or’ clauses prevent us from using a simple groupby, and we don’t want to have to write 4 different queries.
Using complex aggregations, we can access all these different conditions in a single query.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
They key here is using
when to filter different data in and out of different aggregations.
This approach can be quite concise when used with python list comprehensions. I’ll rewrite the query above, but using a list comprehension.
1 2 3 4 5 6 7 8 9 10 11
Voila! Hope you find this little trick helpful! Let me know if you have any questions or comments.