# Balancing Model Weights in PySpark

Imbalanced classes is a common problem. Scikit-learn provides an easy fix - “balancing” class weights. This makes models more likely to predict the less common classes (e.g., logistic regression).

The PySpark ML API doesn’t have this same functionality, so in this blog post, I describe how to balance class weights yourself.

Generate some random data and put the data in a Spark DataFrame. Note that the input variables are not predictive. The model will behave randomly. This is okay, since I am not interested in model accuracy.

Here’s how Scikit-learn computes class weights when “balanced” weights are requested.

Here’s how we can compute “balanced” weights with data from a PySpark DataFrame.

PySpark needs to have a weight assigned to each instance (i.e., row) in the training set. I create a mapping to apply a weight to each training instance.

I assemble all the input features into a vector.

And train a logistic regression. Without the instance weights, the model predicts all instances as the frequent class.

+---------------+
|avg(prediction)|
+---------------+
|            1.0|
+---------------+


With the weights, the model assigns half the instances to each class (even the less commmon one).

+---------------+
|avg(prediction)|
+---------------+
|         0.5089|
+---------------+