When onehot-encoding columns in pyspark, column cardinality can become a problem. The size of the data often leads to an enourmous number of unique values. If a minority of the values are common and the majority of the values are rare, you might want to represent the rare values as a single group. Note that this might not be appropriate for your problem. Here’s some nice text describing the costs and benefits of this approach. In the following blog post I describe how to implement this solution.
I begin by importing the necessary libraries and creating a spark session.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Next create the custom transformer. This class inherits from the Transformer
, HasInputCol
, and HasOutputCol
classes. I also call an additional parameter n
which controls the maximum cardinality allowed in the tranformed column. Because I have the additional parameter, I need some methods for calling and setting this paramter (setN
and getN
). Finally, there’s _tranform
which limits the cardinality of the desired column (set by inputCol
parameter). This tranformation method simply takes the desired column and changes all values greater than n
to n
. It outputs a column named by the outputCol
parameter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Now that we have the tranformer, I will create some data and apply the transformer to it. I want categorical data, so I will randomly draw letters of the alphabet. The only trick is I’ve made some letters of the alphabet much more common than other ones.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Take a look at the data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Now to apply the new class LimitCardinality
after StringIndexer
which maps each category (starting with the most common category) to numbers. This means the most common letter will be 1. LimitCardinality
then sets the max value of StringIndexer
’s output to n
. OneHotEncoderEstimator
one-hot encodes LimitCardinality
’s output. I wrap StringIndexer
, LimitCardinality
, and OneHotEncoderEstimator
into a single pipeline so that I can fit/transform the dataset at one time.
Note that LimitCardinality
needs additional code in order to be saved to disk.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
A quick improvement to LimitCardinality
would be to set a column’s cardinality so that X% of rows retain their category values and 100-X% receive the default value (rather than arbitrarily selecting a cardinality limit). I implement this below. Note that LimitCardinalityModel
is identical to the original LimitCardinality
. The new LimitCardinality
has a _fit
method rather than _transform
and this method determines a column’s cardinality.
In the _fit
method I find the proportion of columns that are required to describe the requested amount of data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
|
There are other options for dealing with high cardinality columns such as using a clustering or a mean encoding scheme.
Hope you find this useful and reach out if you have any questions.