Dan Vatterott

Data Scientist

'Is Not in' With Pyspark

In SQL it’s easy to find people in one list who are not in a second list (i.e., the “not in” command), but there is no similar command in pyspark. Well, at least not a command that doesn’t involve collecting the second list onto the master instance.

Here is a tidbit of code which replicates SQL’s “not in” command, while keeping your data with the workers (it will require a shuffle).

I start by creating some small dataframes.

1
2
3
4
import pyspark
from pyspark.sql import functions as F
a = sc.parallelize([[1, 'a'], [2, 'b'], [3, 'c']]).toDF(['id', 'valueA'])
b = sc.parallelize([[1, 'a'], [4, 'd'], [5, 'e']]).toDF(['id', 'valueB'])

Take a quick look at dataframe a.

1
a.show()
id valueA
1 a
2 b
3 c

And dataframe b.

1
b.show()
id valueA
1 a
4 d
5 e

I create a new column in a that is all ones. I could have used an existing column, but this way I know the column is never null.

1
2
a = a.withColumn('inA', F.lit(1))
a.show()
id valueA inA
1 a 1
2 b 1
3 c 1

I join a and b with a left join. This way all values in b which are not in a have null values in the column “inA”.

1
b.join(a, 'id', 'left').show()
id valueA valueB inA
5 e null null
1 a a 1
4 d null null

By filtering out rows in the new dataframe c, which are not null, I remove all values of b, which were also in a.

1
2
c = b.join(a, 'id', 'left').filter(F.col('inA').isNull())
c.show()
id valueA valueB inA
5 e null null
4 d null null

Comments