In SQL it’s easy to find people in one list who are not in a second list (i.e., the “not in” command), but there is no similar command in PySpark. Well, at least not a command that doesn’t involve collecting the second list onto the master instance.
Check the note at the bottom regarding “anti joins”. Using an anti join is much cleaner than the code described here.
Here is a tidbit of code which replicates SQL’s “not in” command, while keeping your data with the workers (it will require a shuffle).
I recently gave the PySpark documentation a more thorough reading and realized that PySpark’s join command has a left_anti option. The left_anti option produces the same functionality as described above, but in a single join command (no need to create a dummy column and filter).
For example, the following code will produce rows in b where the id value is not present in a.