Dan Vatterott

Data Scientist

Revisting NBA Career Predictions From Rookie Performance...again

Now that the NBA season is done, we have complete data from this year’s NBA rookies. In the past I have tried to predict NBA rookies’ future performance using regression models. In this post I am again trying to predict rookies’ future performance, but now using using a classification approach. When using a classification approach, I predict whether player X will be a “great,” “average,” or “poor” player rather than predicting exactly how productive player X will be.

Much of this post re-uses code from the previous posts, so I skim over some of the repeated code.

As usual, I will post all code as a jupyter notebook on my github.

1
2
3
4
5
6
7
#import some libraries and tell ipython we want inline figures rather than interactive figures. 
import matplotlib.pyplot as plt, pandas as pd, numpy as np, matplotlib as mpl

from __future__ import print_function

%matplotlib inline
plt.style.use('ggplot') #im addicted to ggplot. so pretty.

Load the data. Reminder - this data is available on my github.

1
2
3
4
5
6
7
8
rookie_df = pd.read_pickle('nba_bballref_rookie_stats_2016_Apr_16.pkl') #here's the rookie year data

rook_games = rookie_df['Career Games']>50 #only attempting to predict players that have played at least 50 games
rook_year = rookie_df['Year']>1980 #only attempting to predict players from after 1980

#remove rookies from before 1980 and who have played less than 50 games. I also remove some features that seem irrelevant or unfair
rookie_df_games = rookie_df[rook_games & rook_year] #only players with more than 50 games. 
rookie_df_drop = rookie_df_games.drop(['Year','Career Games','Name'],1)

Load more data, and normalize the data for the PCA transformation.

1
2
3
4
5
6
7
8
from sklearn.preprocessing import StandardScaler

df = pd.read_pickle('nba_bballref_career_stats_2016_Apr_15.pkl')
df = df[df['G']>50]
df_drop = df.drop(['Year','Name','G','GS','MP','FG','FGA','FG%','3P','2P','FT','TRB','PTS','ORtg','DRtg','PER','TS%','3PAr','FTr','ORB%','DRB%','TRB%','AST%','STL%','BLK%','TOV%','USG%','OWS','DWS','WS','WS/48','OBPM','DBPM','BPM','VORP'],1)
X = df_drop.as_matrix() #take data out of dataframe
ScaleModel = StandardScaler().fit(X) #make sure each feature has 0 mean and unit variance. 
X = ScaleModel.transform(X)

In the past I used k-means to group players according to their performance (see my post on grouping players for more info). Here, I use a gaussian mixture model (GMM) to group the players. I use the GMM model because it assigns each player a “soft” label rather than a “hard” label. By soft label I mean that a player simultaneously belongs to several groups. For instance, Russell Westbrook belongs to both my “point guard” group and my “scorers” group. K-means uses hard labels where each player can only belong to one group. I think the GMM model provides a more accurate representation of players, so I’ve decided to use it in this post. Maybe in a future post I will spend more time describing it.

For anyone wondering, the GMM groupings looked pretty similar to the k-means groupings.

1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.mixture import GMM
from sklearn.decomposition import PCA

reduced_model = PCA(n_components=5, whiten=True).fit(X)
reduced_data = reduced_model.transform(X) #transform data into the 5 PCA components space

g = GMM(n_components=6).fit(reduced_data) #6 clusters. like the k-means model
new_labels = g.predict(reduced_data)

predictions = g.predict_proba(reduced_data) #generate values describing "how much" each player belongs to each group 
for x in np.unique(new_labels):
    Label = 'Category%d' % x
    df[Label] = predictions[:,x]

In this past I have attempted to predict win shares per 48 minutes. I am using win shares as a dependent variable again, but I want to categorize players.

Below I create a histogram of players’ win shares per 48.

I split players into 4 groups which I will refer to as “bad,” “below average,” “above average,” and “great”: Poor players are the bottom 10% in win shares per 48, Below average are the 10-50th percentiles, Above average and 50-90th percentiles, Great are the top 10%. This assignment scheme is relatively arbitrary; the model performs similarly with different assignment schemes.

1
2
3
4
5
6
7
8
plt.hist(df['WS/48']);
df['perf_cat'] = 0
df.loc[df['WS/48'] < np.percentile(df['WS/48'],10),'perf_cat'] = 1 #category 1 players are bottom 10%
df.loc[(df['WS/48'] < np.percentile(df['WS/48'],50)) & (df['WS/48'] >= np.percentile(df['WS/48'],10)),'perf_cat'] = 2
df.loc[(df['WS/48'] < np.percentile(df['WS/48'],90)) & (df['WS/48'] >= np.percentile(df['WS/48'],50)),'perf_cat'] = 3
df.loc[df['WS/48'] >= np.percentile(df['WS/48'],90),'perf_cat'] = 4 #category 4 players are top 10%
perc_in_cat = [np.mean(df['perf_cat']==x) for x in np.unique(df['perf_cat'])];
perc_in_cat #print % of palyers in each category as a sanity check
[0.096314496314496317,
 0.40196560196560199,
 0.39950859950859952,
 0.10221130221130222]

My goal is to use rookie year performance to classify players into these 4 categories. I have a big matrix with lots of data about rookie year performance, but the reason that I grouped player using the GMM is because I suspect that players in the different groups have different “paths” to success. I am including the groupings in my classification model and computing interaction terms. The interaction terms will allow rookie performance to produce different predictions for the different groups.

By including interaction terms, I include quite a few predictor features. I’ve printed the number of predictor features and the number of predicted players below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn import preprocessing

df_drop = df[df['Year']>1980]
for x in np.unique(new_labels):
    Label = 'Category%d' % x
    rookie_df_drop[Label] = df_drop[Label] #give rookies the groupings produced by the GMM model

X = rookie_df_drop.as_matrix() #take data out of dataframe   

poly = preprocessing.PolynomialFeatures(2,interaction_only=True) #create interaction terms.
X = poly.fit_transform(X)

Career_data = df[df['Year']>1980]
Y = Career_data['perf_cat'] #get predictor data
print(np.shape(X))
print(np.shape(Y))
(1703, 1432)
(1703,)

Now that I have all the features, it’s time to try and predict which players will be poor, below average, above average, and great. To create these predictions, I will use a logistic regression model.

Because I have so many predictors, correlation between predicting features and over-fitting the data are major concerns. I use regularization and cross-validation to combat these issues.

Specifically, I am using l2 regularization and k-fold 5 cross-validation. Within the cross-validation, I am trying to estimate how much regularization is appropriate.

Some important notes - I am using “balanced” weights which tells the model that worse to incorrectly predict the poor and great players than the below average and above average players. I do this because I don’t want the model to completely ignore the less frequent classifications. Second, I use the multi_class multinomial because it limits the number of models I have to fit.

1
2
3
4
5
6
7
8
9
from sklearn import linear_model
from sklearn.metrics import accuracy_score

logreg = linear_model.LogisticRegressionCV(Cs=[0.0008], cv=5, penalty='l2',n_jobs=-1, class_weight='balanced',
                                           max_iter=15000, multi_class='multinomial')

est = logreg.fit(X, Y)
score = accuracy_score(Y,est.predict(X)) #calculate the % correct 
print(score)
0.738109219025

Okay, the model did pretty well, but lets look at where the errors are coming from. To visualize the models accuracy, I am using a confusion matrix. In a confusion matrix, every item on the diagnonal is a correctly classified item. Every item off the diagonal is incorrectly classified. The color bar’s axis is the percent correct. So the dark blue squares represent cells with more items.

It seems the model is best at predicting poor players and great players. It makes more errors when trying to predict the more average players.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y, est.predict(X))

def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap,vmin=0.0, vmax=1.0)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(np.unique(df['perf_cat'])))
    plt.xticks(tick_marks, np.unique(df['perf_cat']))
    plt.yticks(tick_marks, np.unique(df['perf_cat']))
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')

Lets look at what the model predicts for this year’s rookies. Below I modified two functions that I wrote for a previous post. The first function finds a particular year’s draft picks. The second function produces predictions for each draft pick.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def gather_draftData(Year):

    import urllib2
    from bs4 import BeautifulSoup
    import pandas as pd
    import numpy as np

    draft_len = 30

    def convert_float(val):
        try:
            return float(val)
        except ValueError:
            return np.nan

    url = 'http://www.basketball-reference.com/draft/NBA_'+str(Year)+'.html'
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html,"lxml")

    draft_num = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[0].text for i in range(draft_len)]
    draft_nam = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[3].text for i in range(draft_len)]

    draft_df = pd.DataFrame([draft_num,draft_nam]).T
    draft_df.columns = ['Number','Name']
    df.index = range(np.size(df,0))
    return draft_df

def player_prediction__regressionModel(PlayerName):

    clust_df = pd.read_pickle('nba_bballref_career_stats_2016_Apr_15.pkl')
    clust_df = clust_df[clust_df['Name']==PlayerName]
    clust_df = clust_df.drop(['Year','Name','G','GS','MP','FG','FGA','FG%','3P','2P','FT','TRB','PTS','ORtg','DRtg','PER','TS%','3PAr','FTr','ORB%','DRB%','TRB%','AST%','STL%','BLK%','TOV%','USG%','OWS','DWS','WS','WS/48','OBPM','DBPM','BPM','VORP'],1)
    new_vect = ScaleModel.transform(clust_df.as_matrix().reshape(1,-1))
    reduced_data = reduced_model.transform(new_vect)
    predictions = g.predict_proba(reduced_data)
    for x in np.unique(new_labels):
        Label = 'Category%d' % x
        clust_df[Label] = predictions[:,x]

    Predrookie_df = pd.read_pickle('nba_bballref_rookie_stats_2016_Apr_16.pkl')
    Predrookie_df = Predrookie_df[Predrookie_df['Name']==PlayerName]
    Predrookie_df = Predrookie_df.drop(['Year','Career Games','Name'],1)
    for x in np.unique(new_labels):
        Label = 'Category%d' % x
        Predrookie_df[Label] = clust_df[Label] #give rookies the groupings produced by the GMM model
    predX = Predrookie_df.as_matrix() #take data out of dataframe
    predX = poly.fit_transform(predX)
    predictions2 = est.predict_proba(predX)
    return {'Name':PlayerName,'Group':predictions,'Prediction':predictions2[0]}

Below I create a plot depicting the model’s predictions. On the y-axis are the four classifications. On the x-axis are the players from the 2015 draft. Each cell in the plot is the probability of a player belonging to one of the classifications. Again, dark blue means a cell or more likely. Good news for us T-Wolves fans! The model loves KAT.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
draft_df = gather_draftData(2015)

draft_df['Name'][14] =  'Kelly Oubre Jr.' #annoying name inconsistencies 

plt.subplots(figsize=(14,6));

draft_df = draft_df.drop(25, 0) #spurs' 1st round pick has not played yet

predictions = []
for name in draft_df['Name']:
    draft_num = draft_df[draft_df['Name']==name]['Number']
    predict_dict = player_prediction__regressionModel(name)
    predictions.append(predict_dict['Prediction'])

plt.imshow(np.array(predictions).T, interpolation='nearest', cmap=plt.cm.Blues,vmin=0.0, vmax=1.0)
plt.title('Predicting Future Performance of 2015-16 Rookies')
plt.colorbar(shrink=0.25)
tick_marks = np.arange(len(np.unique(df['perf_cat'])))
plt.xticks(range(0,29),draft_df['Name'],rotation=90)
plt.yticks(range(0,4), ['Poor','Below Average','Above Average','Great'])
plt.tight_layout()
plt.ylabel('Prediction')
plt.xlabel('Draft Position');

Creating Videos of NBA Action With Sportsvu Data

All basketball teams have a camera system called SportVU installed in their arenas. These camera systems track players and the ball throughout a basketball game.

The data produced by sportsvu camera systems used to be freely available on NBA.com, but was recently removed (I have no idea why). Luckily, the data for about 600 games are available on neilmj’s github. In this post, I show how to create a video recreation of a given basketball play using the sportsvu data.

This code is also available as a jupyter notebook on my github.

1
2
3
4
5
#import some libraries
import matplotlib.pyplot as plt, pandas as pd, numpy as np, matplotlib as mpl
from __future__ import print_function

mpl.rcParams['font.family'] = ['Bitstream Vera Sans']

The data is provided as a json. Here’s how to import the python json library and load the data. I’m a T-Wolves fan, so the game I chose is a wolves game.

1
2
3
import json #import json library
json_data = open('/home/dan-laptop/github/BasketballData/2016.NBA.Raw.SportVU.Game.Logs/0021500594.json') #import the data from wherever you saved it.
data = json.load(json_data) #load the data

Let’s take a quick look at the data. It’s a dictionary with three keys: gamedate, gameid, and events. Gamedate and gameid are the date of this game and its specific id number, respectively. Events is the structure with data we’re interested in.

1
data.keys()
[u'gamedate', u'gameid', u'events']

Lets take a look at the first event. The first event has an associated eventid number. We will use these later. There’s also data for each player on the visiting and home team. We will use these later too. Finally, and most importantly, there’s the “moments.” There are 25 moments for each second of the “event” (the data is sampled at 25hz).

1
data['events'][0].keys()
[u'eventId', u'visitor', u'moments', u'home']

Here’s the first moment of the first event. The first number is the quarter. The second number is the time of the event in milliseconds. The third number is the number of seconds left in the quarter (the 1st quarter hasn’t started yet, so 12 * 60 = 720). The fourth number is the number of seconds left on the shot clock. I am not sure what fourth number (None) represents.

The final matrix is 11x5 matrix. The first row describes the ball. The first two columns are the teamID and the playerID of the ball (-1 for both because the ball does not belong to a team and is not a player). The 3rd and 4th columns are xy coordinates of the ball. The final column is the height of the ball (z coordinate).

The next 10 rows describe the 10 players on the court. The first 5 players belong to the home team and the last 5 players belong to the visiting team. Each player has his teamID, playerID, xy&z coordinates (although I don’t think players’ z coordinates ever change).

1
data['events'][0]['moments'][0]
[1,
 1452903036782,
 720.0,
 24.0,
 None,
 [[-1, -1, 44.16456, 26.34142, 5.74423],
  [1610612760, 201142, 45.46259, 32.01456, 0.0],
  [1610612760, 201566, 10.39347, 24.77219, 0.0],
  [1610612760, 201586, 25.86087, 25.55881, 0.0],
  [1610612760, 203460, 47.28525, 17.76225, 0.0],
  [1610612760, 203500, 43.68634, 26.63098, 0.0],
  [1610612750, 708, 55.6401, 25.55583, 0.0],
  [1610612750, 2419, 47.95942, 31.66328, 0.0],
  [1610612750, 201937, 67.28725, 25.10267, 0.0],
  [1610612750, 203952, 47.28525, 17.76225, 0.0],
  [1610612750, 1626157, 49.46814, 24.24193, 0.0]]]

Alright, so we have the sportsvu data, but its not clear what each event is. Luckily, the NBA also provides play by play (pbp) data. I write a function for acquiring play by play game data. This function collects (and trims) the play by play data for a given sportsvu data set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def acquire_gameData(data):
    import requests
    header_data = { #I pulled this header from the py goldsberry library
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'en-US,en;q=0.8',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)'\
        ' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 '\
        'Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9'\
        ',image/webp,*/*;q=0.8',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive'
    }
    game_url = 'http://stats.nba.com/stats/playbyplayv2?EndPeriod=0&EndRange=0&GameID='+data['gameid']+\
                '&RangeType=0&StartPeriod=0&StartRange=0' #address for querying the data
    response = requests.get(game_url,headers = header_data) #go get the data
    headers = response.json()['resultSets'][0]['headers'] #get headers of data
    gameData = response.json()['resultSets'][0]['rowSet'] #get actual data from json object
    df = pd.DataFrame(gameData, columns=headers) #turn the data into a pandas dataframe
    df = df[[df.columns[1], df.columns[2],df.columns[7],df.columns[9],df.columns[18]]] #there's a ton of data here, so I trim  it doown
    df['TEAM'] = df['PLAYER1_TEAM_ABBREVIATION']
    df = df.drop('PLAYER1_TEAM_ABBREVIATION', 1)
    return df

Below I show what the play by play data looks like. There’s a column for event number (eventnum). These event numbers match up with the event numbers from the sportsvu data, so we will use this later for seeking out specific plays in the sportsvu data. There’s a column for the event type (eventmsgtype). This column has a number describing what occured in the play. I list these number codes in the comments below.

There’s also short text descriptions of the plays in the home description and visitor description columns. Finally, I use the team column to represent the primary team involved in a play.

I stole the idea of using play by play data from Raji Shah.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
df = acquire_gameData(data)
df.head()
#EVENTMSGTYPE
#1 - Make 
#2 - Miss 
#3 - Free Throw 
#4 - Rebound 
#5 - out of bounds / Turnover / Steal 
#6 - Personal Foul 
#7 - Violation 
#8 - Substitution 
#9 - Timeout 
#10 - Jumpball 
#12 - Start Q1? 
#13 - Start Q2?
EVENTNUM EVENTMSGTYPE HOMEDESCRIPTION VISITORDESCRIPTION TEAM
0 0 12 None None None
1 1 10 Jump Ball Adams vs. Towns: Tip to Ibaka None OKC
2 2 5 Westbrook Out of Bounds Lost Ball Turnover (P1... None OKC
3 3 2 None MISS Wiggins 16' Jump Shot MIN
4 4 4 Westbrook REBOUND (Off:0 Def:1) None OKC

When viewing the videos, its nice to know what players are on the court. I like to depict this by labeling each player with their number. Here I create a dictionary that contains each player’s id number (these are assigned by nba.com) as the key and their jersey number as the associated value.

1
2
3
4
5
player_fields = data['events'][0]['home']['players'][0].keys()
home_players = pd.DataFrame(data=[i for i in data['events'][0]['home']['players']], columns=player_fields)
away_players = pd.DataFrame(data=[i for i in data['events'][0]['visitor']['players']], columns=player_fields)
players = pd.merge(home_players, away_players, how='outer')
jerseydict = dict(zip(players.playerid.values, players.jersey.values))

Alright, almost there! Below I write some functions for creating the actual video! First, there’s a short function for placing an image of the basketball court beneath our depiction of players moving around. This image is from gmf05’s github, but I will provide it on mine too.

Much of this code is either straight from gmf05’s github or slightly modified.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Animation function / loop
def draw_court(axis):
    import matplotlib.image as mpimg
    img = mpimg.imread('./nba_court_T.png') #read image. I got this image from gmf05's github.
    plt.imshow(img,extent=axis, zorder=0) #show the image. 

def animate(n): #matplotlib's animation function loops through a function n times that draws a different frame on each iteration
    for i,ii in enumerate(player_xy[n]): #loop through all the players
        player_circ[i].center = (ii[1], ii[2]) #change each players xy position
        player_text[i].set_text(str(jerseydict[ii[0]])) #draw the text for each player. 
        player_text[i].set_x(ii[1]) #set the text x position
        player_text[i].set_y(ii[2]) #set text y position
    ball_circ.center = (ball_xy[n,0],ball_xy[n,1]) #change ball xy position
    ball_circ.radius = 1.1 #i could change the size of the ball according to its height, but chose to keep this constant
    return tuple(player_text) + tuple(player_circ) + (ball_circ,)

def init(): #this is what matplotlib's animation will create before drawing the first frame. 
    for i in range(10): #set up players
        player_text[i].set_text('')
        ax.add_patch(player_circ[i])
    ax.add_patch(ball_circ) #create ball
    ax.axis('off') #turn off axis
    dx = 5
    plt.xlim([0-dx,100+dx]) #set axis
    plt.ylim([0-dx,50+dx])
    return tuple(player_text) + tuple(player_circ) + (ball_circ,)

The event that I want to depict is event 41. In this event, Karl Anthony Towns misses a shot, grabs his own rebounds, and puts it back in.

1
df[37:38]
EVENTNUM EVENTMSGTYPE HOMEDESCRIPTION VISITORDESCRIPTION TEAM
37 41 1 None Towns 1' Layup (2 PTS) MIN

We need to find where event 41 is in the sportsvu data structure, so I created a function for finding the location of a particular event. I then create a matrix with position data for the ball and a matrix with position data for each player for event 41.

1
2
3
4
5
6
7
8
9
10
11
12
#the order of events does not match up, so we have to use the eventIds. This loop finds the correct event for a given id#.
search_id = 41
def find_moment(search_id):
    for i,events in enumerate(data['events']):
        if events['eventId'] == str(search_id):
            finder = i
            break
    return finder

event_num = find_moment(search_id)
ball_xy = np.array([x[5][0][2:5] for x in data['events'][event_num]['moments']]) #create matrix of ball data
player_xy = np.array([np.array(x[5][1:])[:,1:4] for x in data['events'][event_num]['moments']]) #create matrix of player data

Okay. We’re actually there! Now we get to create the video. We have to create figure and axes objects for the animation to draw on. Then I place a picture of the basketball court on this plot. Finally, I create the circle and text objects that will move around throughout the video (depicting the ball and players). The location of these objects are then updated in the animation loop.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import matplotlib.animation as animation

fig = plt.figure(figsize=(15,7.5)) #create figure object
ax = plt.gca() #create axis object

draw_court([0,100,0,50]) #draw the court
player_text = range(10) #create player text vector
player_circ = range(10) #create player circle vector
ball_circ = plt.Circle((0,0), 1.1, color=[1, 0.4, 0]) #create circle object for bal
for i in range(10): #create circle object and text object for each player
    col=['w','k'] if i<5 else ['k','w'] #color scheme
    player_circ[i] = plt.Circle((0,0), 2.2, facecolor=col[0],edgecolor='k') #player circle
    player_text[i] = ax.text(0,0,'',color=col[1],ha='center',va='center') #player jersey # (text)

ani = animation.FuncAnimation(fig, animate, frames=np.arange(0,np.size(ball_xy,0)), init_func=init, blit=True, interval=5, repeat=False,\
                             save_count=0) #function for making video
ani.save('Event_%d.mp4' % (search_id),dpi=100,fps=25) #function for saving video
plt.close('all') #close the plot

I’ve been told this video does not work for all users. I’ve also posted it on youtube.

NBA Shot Charts: Updated

For some reason I recently got it in my head that I wanted to go back and create more NBA shot charts. My previous shotcharts used colored circles to depict the frequency and effectiveness of shots at different locations. This is an extremely efficient method of representing shooting profiles, but I thought it would be fun to create shot charts that represent a player’s shooting profile continously across the court rather than in discrete hexagons.

By depicting the shooting data continously, I lose the ability to represent one dimenion - I can no longer use the size of circles to depict shot frequency at a location. Nonetheless, I thought it would be fun to create these charts.

I explain how to create them below. I’ve also included the ability to compare a player’s shooting performance to the league average.

In my previous shot charts, I query nba.com’s API when creating a players shot chart, but querying nba.com’s API for every shot taken in 2015-16 takes a little while (for computing league average), so I’ve uploaded this data to my github and call the league data as a file rather than querying nba.com API.

This code is also available as a jupyter notebook on my github.

1
2
3
#import some libraries and tell ipython we want inline figures rather than interactive figures.
%matplotlib inline
import matplotlib.pyplot as plt, pandas as pd, numpy as np, matplotlib as mpl

Here, I create a function for querying shooting data from NBA.com’s API. This is the same function I used in my previous post regarding shot charts.

You can find a player’s ID number by going to the players nba.com page and looking at the page address. There is a python library that you can use for querying player IDs (and other data from the nba.com API), but I’ve found this library to be a little shaky.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def aqcuire_shootingData(PlayerID,Season):
    import requests
    header_data = { #I pulled this header from the py goldsberry library
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'en-US,en;q=0.8',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)'\
        ' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 '\
        'Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9'\
        ',image/webp,*/*;q=0.8',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive'
    }
    shot_chart_url = 'http://stats.nba.com/stats/shotchartdetail?CFID=33&CFPARAMS='+Season+'&ContextFilter='\
                    '&ContextMeasure=FGA&DateFrom=&DateTo=&GameID=&GameSegment=&LastNGames=0&LeagueID='\
                    '00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PaceAdjust='\
                    'N&PerMode=PerGame&Period=0&PlayerID='+PlayerID+'&PlusMinus=N&Position=&Rank='\
                    'N&RookieYear=&Season='+Season+'&SeasonSegment=&SeasonType=Regular+Season&TeamID='\
                    '0&VsConference=&VsDivision=&mode=Advanced&showDetails=0&showShots=1&showZones=0'
    response = requests.get(shot_chart_url,headers = header_data)
    headers = response.json()['resultSets'][0]['headers']
    shots = response.json()['resultSets'][0]['rowSet']
    shot_df = pd.DataFrame(shots, columns=headers)
    return shot_df

Create a function for drawing the nba court. This function was taken directly from Savvas Tjortjoglou’s post on shot charts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def draw_court(ax=None, color='black', lw=2, outer_lines=False):
    from matplotlib.patches import Circle, Rectangle, Arc
    if ax is None:
        ax = plt.gca()
    hoop = Circle((0, 0), radius=7.5, linewidth=lw, color=color, fill=False)
    backboard = Rectangle((-30, -7.5), 60, -1, linewidth=lw, color=color)
    outer_box = Rectangle((-80, -47.5), 160, 190, linewidth=lw, color=color,
                          fill=False)
    inner_box = Rectangle((-60, -47.5), 120, 190, linewidth=lw, color=color,
                          fill=False)
    top_free_throw = Arc((0, 142.5), 120, 120, theta1=0, theta2=180,
                         linewidth=lw, color=color, fill=False)
    bottom_free_throw = Arc((0, 142.5), 120, 120, theta1=180, theta2=0,
                            linewidth=lw, color=color, linestyle='dashed')
    restricted = Arc((0, 0), 80, 80, theta1=0, theta2=180, linewidth=lw,
                     color=color)
    corner_three_a = Rectangle((-220, -47.5), 0, 140, linewidth=lw,
                               color=color)
    corner_three_b = Rectangle((220, -47.5), 0, 140, linewidth=lw, color=color)
    three_arc = Arc((0, 0), 475, 475, theta1=22, theta2=158, linewidth=lw,
                    color=color)
    center_outer_arc = Arc((0, 422.5), 120, 120, theta1=180, theta2=0,
                           linewidth=lw, color=color)
    center_inner_arc = Arc((0, 422.5), 40, 40, theta1=180, theta2=0,
                           linewidth=lw, color=color)
    court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw,
                      bottom_free_throw, restricted, corner_three_a,
                      corner_three_b, three_arc, center_outer_arc,
                      center_inner_arc]
    if outer_lines:
        outer_lines = Rectangle((-250, -47.5), 500, 470, linewidth=lw,
                                color=color, fill=False)
        court_elements.append(outer_lines)

    for element in court_elements:
        ax.add_patch(element)

    ax.set_xticklabels([])
    ax.set_yticklabels([])
    ax.set_xticks([])
    ax.set_yticks([])
    return ax

Write a function for acquiring each player’s picture. This isn’t essential, but it makes things look nicer. This function takes a playerID number and the amount to zoom in on an image as the inputs. It by default places the image at the location 500,500.

1
2
3
4
5
6
7
8
9
def acquire_playerPic(PlayerID, zoom, offset=(500,500)):
    from matplotlib import  offsetbox as osb
    import urllib
    pic = urllib.urlretrieve("http://stats.nba.com/media/players/230x185/"+PlayerID+".png",PlayerID+".png")
    player_pic = plt.imread(pic[0])
    img = osb.OffsetImage(player_pic, zoom)
    #img.set_offset(offset)
    img = osb.AnnotationBbox(img, offset,xycoords='data',pad=0.0, box_alignment=(1,0), frameon=False)
    return img

Here is where things get a little complicated. Below I write a function that divides the shooting data into a 25x25 matrix. Each shot taken within the xy coordinates encompassed by a given bin counts towards the shot count in that bin. In this way, the method I am using here is very similar to my previous hexbins (circles). So the difference just comes down to I present the data rather than how I preprocess it.

This function takes a dataframe with a vector of shot locations in the X plane, a vector with shot locations in the Y plane, a vector with shot type (2 pointer or 3 pointer), and a vector with ones for made shots and zeros for missed shots. The function by default bins the data into a 25x25 matrix, but the number of bins is editable. The 25x25 bins are then expanded to encompass a 500x500 space.

The output is a dictionary containing matrices for shots made, attempted, and points scored in each bin location. The dictionary also has the player’s ID number.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def shooting_matrices(df,bins=25):
    from math import floor

    df['SHOT_TYPE2'] = [int(x[0][0]) for x in df['SHOT_TYPE']] #create a vector with whether the shot is a 2 or 3 pointer
    points_matrix = np.zeros((bins,bins)) #create a matrix to fill with shooting data.

    shot_attempts, xtest, ytest, p = plt.hist2d(df[df['LOC_Y']<425.1]['LOC_X'], #use histtd to bin the data. These are attempts
                                                df[df['LOC_Y']<425.1]['LOC_Y'],
                                                bins=bins,range=[[-250,250],[-25,400]]); #i limit the range of the bins because I don't care about super far away shots and I want the bins standardized across players
    plt.close()

    shot_made, xtest2, ytest2, p = plt.hist2d(df[(df['LOC_Y']<425.1) & (df['SHOT_MADE_FLAG']==1)]['LOC_X'], #again use hist 2d to bin made shots
                                    df[(df['LOC_Y']<425.1) & (df['SHOT_MADE_FLAG']==1)]['LOC_Y'],
                                    bins=bins,range=[[-250,250],[-25,400]]);
    plt.close()
    differy = np.diff(ytest)[0] #get the leading yedge
    differx = np.diff(xtest)[0] #get the leading xedge
    for i,(x,y) in enumerate(zip(df['LOC_X'],df['LOC_Y'])):
        if x >= 250 or x <= -250 or y <= -25.1 or y >= 400: continue
        points_matrix[int(floor(np.divide(x+250,differx))),int(floor(np.divide(y+25,differy)))] += np.float(df['SHOT_MADE_FLAG'][i]*df['SHOT_TYPE2'][i])
        #loop through all the shots and tally the points made in each bin location.

    shot_attempts = np.repeat(shot_attempts,500/bins,axis=0) #repeat the shot attempts matrix so that it fills all xy points
    shot_attempts = np.repeat(shot_attempts,500/bins,axis=1)
    shot_made = np.repeat(shot_made,500/bins,axis=0) #repeat shot made so that it fills all xy points (rather than just bin locations)
    shot_made = np.repeat(shot_made,500/bins,axis=1)
    points_matrix = np.repeat(points_matrix,500/bins,axis=0) #again repeat with points
    points_matrix = np.repeat(points_matrix,500/bins,axis=1)
    return {'attempted':shot_attempts,'made':shot_made,'points':points_matrix,'id':str(np.unique(df['PLAYER_ID'])[0])}

Below I load the league average data. I also have the code that I used to originally download the data and to preprocess it.

1
2
3
4
5
6
7
8
9
import pickle

#df = aqcuire_shootingData('0','2015-16') #here is how I acquired data about every shot taken in 2015-16
#df2 = pd.read_pickle('nba_shots_201516_2016_Apr_27.pkl') #here is how you can read all the league shot data
#league_shotDict = shooting_matrices(df2) #turn the shot data into the shooting matrix
#pickle.dump(league_shotDict, open('league_shotDictionary_2016.pkl', 'wb' )) #save the data

#I should make it so this is the plot size by default, but people can change it if they want. this would be slower.
league_shotDict = pickle.load(open('league_shotDictionary_2016.pkl', 'rb' )) #read in the a precreated shot chart for the entire league

I really like playing with the different color maps, so here is a new color map I created for these shot charts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cmap = plt.cm.CMRmap_r #start with the CMR map in reverse.

maxer = 0.6 #max value to take in the CMR map

the_range = np.arange(0,maxer+0.1,maxer/4) #divide the map into 4 values
the_range2 = [0.0,0.25,0.5,0.75,1.0] #or use these values

mapper = [cmap(x) for x in the_range] #grab color values for this dictionary
cdict = {'red':[],'green':[],'blue':[]} #fill teh values into a color dictionary
for item,place in zip(mapper,the_range2):
    cdict['red'].append((place,item[0], item[0]))
    cdict['green'].append((place,item[1],item[1]))
    cdict['blue'].append((place,item[2],item[2]))

mymap = mpl.colors.LinearSegmentedColormap('my_colormap', cdict, 1024) #linearly interpolate between color values

Below, I write a function for creating the nba shot charts. The function takes a dictionary with martrices for shots attempted, made, and points scored. The matrices should be 500x500. By default, the shot chart depicts the number of shots taken across locations, but it can also depict the number of shots made, field goal percentage, and point scored across locations.

The function uses a gaussian kernel with standard deviation of 5 to smooth the data (make it look pretty). Again, this is editable. By default the function plots a players raw data, but it will plot how a player compares to league average if the input includes a matrix of league average data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def create_shotChart(shotDict,fig_type='attempted',smooth=5,league_shotDict=[],mymap=mymap):
    from scipy.ndimage.filters import gaussian_filter

    if fig_type == 'fg': #how to treat the data if depicting fg percentage
        interest_measure = shotDict['made']/shotDict['attempted']
        #interest_measure[np.isnan(interest_measure)] = np.nanmean(interest_measure)
        #interest_measure = np.nan_to_num(interest_measure) #replace places where divide by 0 with a 0
    else:
        interest_measure = shotDict[fig_type] #else take the data from dictionary.

    if league_shotDict: #if we have league data, we have to select the relevant league data.
        if fig_type == 'fg':
            league = league_shotDict['made']/league_shotDict['attempted']
            league = np.nan_to_num(league)
            interest_measure[np.isfinite(interest_measure)] += -league[np.isfinite(interest_measure)] #compare league data and invidual player's data
            interest_measure = np.nan_to_num(interest_measure) #replace places where divide by 0 with a 0
            maxer = 0 + 1.5*np.std(interest_measure) #min and max values for color map
            minner = 0- 1.5*np.std(interest_measure)
        else:
            player_percent = interest_measure/np.sum([x[::20] for x in player_shotDict[fig_type][::20]]) #standardize data before comparing
            league_percent = league_shotDict[fig_type]/np.sum([x[::20] for x in league_shotDict[fig_type][::20]]) #standardize league data
            interest_measure = player_percent-league_percent #compare league and individual data
            maxer = np.mean(interest_measure) + 1.5*np.std(interest_measure) #compute max and min values for color map
            minner = np.mean(interest_measure) - 1.5*np.std(interest_measure)

        cmap = 'bwr' #use bwr color map if comparing to league average
        label = ['<Avg','Avg', '>Avg'] #color map legend label

    else:
        cmap = mymap #else use my color map
        interest_measure = np.nan_to_num(interest_measure) #replace places where divide by 0 with a 0
        maxer = np.mean(interest_measure) + 1.5*np.std(interest_measure) #compute max for colormap
        minner = 0
        label = ['Less','','More'] #color map legend label

    ppr_smooth = gaussian_filter(interest_measure,smooth) #smooth the data

    fig = plt.figure(figsize=(12,7),frameon=False)#(12,7)
    ax = fig.add_axes([0.1, 0.1, 0.8, 0.8]) #where to place the plot within the figure
    draw_court(outer_lines=False) #draw court
    ax.set_xlim(-250,250)
    ax.set_ylim(400, -25)

    ax2 = fig.add_axes(ax.get_position(), frameon=False)

    colrange = mpl.colors.Normalize(vmin=minner, vmax=maxer, clip=False) #standardize color range
    ax2.imshow(ppr_smooth.T,cmap=cmap,norm=colrange,alpha=0.7,aspect='auto') #plot data
    ax2.set_xticklabels([])
    ax2.set_yticklabels([])
    ax2.set_xticks([])
    ax2.set_xlim(0, 500)
    ax2.set_ylim(500, 0)
    ax2.set_yticks([]);

    ax3 = fig.add_axes([0.92, 0.1, 0.02, 0.8]) #place colormap legend
    cb = mpl.colorbar.ColorbarBase(ax3,cmap=cmap, orientation='vertical')
    if fig_type == 'fg': #colormap label
        cb.set_label('Field Goal Percentage')
    else:
        cb.set_label('Shots '+fig_type)

    cb.set_ticks([0,0.5,1.0])
    ax3.set_yticklabels(label,rotation=45);

    zoom = np.float(12)/(12.0*2) #place player pic
    img = acquire_playerPic(player_shotDict['id'], zoom)
    ax2.add_artist(img)

    plt.show()
    return ax

Alright, thats that. Now lets create some plots. I am a t-wolves fan, so I will plot data from Karl Anthony Towns.

First, here is the default plot - attempts.

1
2
3
df = aqcuire_shootingData('1626157','2015-16')
player_shotDict = shooting_matrices(df)
create_shotChart(player_shotDict);

Here’s KAT’s shots made

1
2
3
df = aqcuire_shootingData('1626157','2015-16')
player_shotDict = shooting_matrices(df)
create_shotChart(player_shotDict,fig_type='made');

Here’s field goal percentage. I don’t like this one too much. It’s hard to use similar scales for attempts and field goal percentage even though I’m using standard deviations rather than absolute scales.

1
2
3
df = aqcuire_shootingData('1626157','2015-16')
player_shotDict = shooting_matrices(df)
create_shotChart(player_shotDict, fig_type='fg');

Here’s points across the court.

1
2
3
df = aqcuire_shootingData('1626157','2015-16')
player_shotDict = shooting_matrices(df)
create_shotChart(player_shotDict, fig_type='points');

Here’s how KAT’s attempts compare to the league average. You can see the twolve’s midrange heavy offense.

1
2
3
df = aqcuire_shootingData('1626157','2015-16')
player_shotDict = shooting_matrices(df)
create_shotChart(player_shotDict, league_shotDict=league_shotDict);

How KAT’s shots made compares to league average.

1
2
3
df = aqcuire_shootingData('1626157','2015-16')
player_shotDict = shooting_matrices(df)
create_shotChart(player_shotDict, fig_type='made',league_shotDict=league_shotDict);

How KAT’s field goal percentage compares to league average. Again, the scale on these is not too good.

1
2
3
df = aqcuire_shootingData('1626157','2015-16')
player_shotDict = shooting_matrices(df)
create_shotChart(player_shotDict, fig_type='fg',league_shotDict=league_shotDict);

And here is how KAT’s points compare to league average.

1
2
3
df = aqcuire_shootingData('1626157','2015-16')
player_shotDict = shooting_matrices(df)
create_shotChart(player_shotDict, fig_type='points',league_shotDict=league_shotDict);

An Introduction to Neural Networks: Part 2

In a previous post, I described how to do backpropogation with a 1-layer neural network. I’ve written this post assuming some familiarity with the previous post.

When first created, 1-layer neural networks brought about quite a bit of excitement, but this excitement quickly dissipated when researchers realized that 1-layer neural networks could only solve a limited set of problems.

Researchers knew that adding an extra layer to the neural networks enabled neural networks to solve much more complex problems, but they didn’t know how to train these more complex networks.

In the previous post, I described “backpropogation,” but this wasn’t the portion of backpropogation that really changed the history of neural networks. What really changed neural networks is backpropogation with an extra layer. This extra layer enabled researchers to train more complex networks. The extra layer(s) is(are) called the hidden layer(s). In this post, I will describe backpropogation with a hidden layer.

To describe backpropogation with a hidden layer, I will demonstrate how neural networks can solve the XOR problem.

In this example of the XOR problem there are four items. Each item is defined by two values. If these two values are the same, then the item belongs to one group (blue here). If the two values are different, then the item belongs to another group (red here).

Below, I have depicted the XOR problem. The goal is to find a model that can distinguish between the blue and red groups based on an item’s values.

This code is also available as a jupyter notebook on my github.

1
2
3
4
5
6
7
8
9
10
import numpy as np #import important libraries.
from matplotlib import pyplot as plt
import pandas as pd
%matplotlib inline

plt.plot([0,1],[0,1],'bo')
plt.plot([0,1],[1,0],'ro')
plt.ylabel('Value 2')
plt.xlabel('Value 1')
plt.axis([-0.5,1.5,-0.5,1.5]);

Again, each item has two values. An item’s first value is represented on the x-axis. An items second value is represented on the y-axis. The red items belong to one category and the blue items belong to another.

This is a non-linear problem because no linear function can segregate the groups. For instance, a horizontal line could segregate the upper and lower items and a vertical line could segregate the left and right items, but no single linear function can segregate the red and blue items.

We need a non-linear function to seperate the groups, and neural networks can emulate a non-linear function that segregates them.

While this problem may seem relatively simple, it gave the initial neural networks quite a hard time. In fact, this is the problem that depleted much of the original enthusiasm for neural networks.

Neural networks can easily solve this problem, but they require an extra layer. Below I depict a network with an extra layer (a 2-layer network). To depict the network, I use a repository available on my github.

1
2
3
4
5
6
7
from visualise_neural_network import NeuralNetwork

network = NeuralNetwork() #create neural network object
network.add_layer(2,['Input 1','Input 2']) #input layer with names
network.add_layer(2,['Hidden 1','Hidden 2']) #hidden layer with names
network.add_layer(1,['Output']) #output layer with name
network.draw()

Notice that this network now has 5 total neurons. The two units at the bottom are the input layer. The activity of input units is the value of the inputs (same as the inputs in my previous post). The two units in the middle are the hidden layer. The activity of hidden units are calculated in the same manner as the output units from my previous post. The unit at the top is the output layer. The activity of this unit is found in the same manner as in my previous post, but the activity of the hidden units replaces the input units.

Thus, when the neural network makes its guess, the only difference is we have to compute an extra layer’s activity.

The goal of this network is for the output unit to have an activity of 0 when presented with an item from the blue group (inputs are same) and to have an activity of 1 when presented with an item from the red group (inputs are different).

One additional aspect of neural networks that I haven’t discussed is each non-input unit can have a bias. You can think about bias as a propensity for the unit to become active or not to become active. For instance, a unit with a postitive bias is more likely to be active than a unit with no bias.

I will implement bias as an extra line feeding into each unit. The weight of this line is the bias, and the bias line is always active, meaning this bias is always present.

Below, I seed this 3-layer neural network with a random set of weights.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
np.random.seed(seed=10) #seed random number generator for reproducibility

Weights_2 = np.random.rand(1,3)-0.5*2 #connections between hidden and output
Weights_1 = np.random.rand(2,3)-0.5*2 #connections between input and hidden

Weight_Dict = {'Weights_1':Weights_1,'Weights_2':Weights_2} #place weights in a dictionary

Train_Set = [[1.0,1.0],[0.0,0.0],[0.0,1.0],[1.0,0.0]] #train set

network = NeuralNetwork()
network.add_layer(2,['Input 1','Input 2'],
                  [[round(x,2) for x in Weight_Dict['Weights_1'][0][:2]],
                   [round(x,2) for x in Weight_Dict['Weights_1'][1][:2]]])
#add input layer with names and weights leaving the input neurons
network.add_layer(2,[round(Weight_Dict['Weights_1'][0][2],2),round(Weight_Dict['Weights_1'][1][2],2)],
                  [round(x,2) for x in Weight_Dict['Weights_2'][0][:2]])
#add hidden layer with names (each units' bias) and weights leaving the hidden units
network.add_layer(1,[round(Weight_Dict['Weights_2'][0][2],2)])
#add output layer with name (the output unit's bias)
network.draw()

Above we have out network. The depiction of and are confusing. -0.8 belongs to . -0.5 belongs to .

Lets go through one example of our network receiving an input and making a guess. Lets say the input is [0 1]. This means and . The correct answer in this case is 1.

First, we have to calculate ’s input. Remember we can write input as

with the a bias we can rewrite it as

Specifically for

Remember the first term in the equation above is the bias term. Lets see what this looks like in code.

1
2
3
Input = np.array([0,1])
net_Hidden = np.dot(np.append(Input,1.0),Weights_1.T) #append the bias input
print net_Hidden
[-1.27669634 -1.07035845]

Note that by using np.dot, I can calculate both hidden unit’s input in a single line of code.

Next, we have to find the activity of units in the hidden layer.

I will translate input into activity with a logistic function, as I did in the previous post.

Lets see what this looks like in code.

1
2
3
4
5
def logistic(x): #each neuron has a logistic activation function
    return 1.0/(1+np.exp(-x))

Hidden_Units = logistic(net_Hidden)
print Hidden_Units
[ 0.2181131   0.25533492]

So far so good, the logistic function has transformed the negative inputs into values near 0.

Now we have to compute the output unit’s acitivity.

plugging in the numbers

Now the code for computing and the Output unit’s activity.

1
2
3
4
5
6
net_Output = np.dot(np.append(Hidden_Units,1.0),Weights_2.T)
print 'net_Output'
print net_Output
Output = logistic(net_Output)
print 'Output'
print Output
net_Output
[-0.66626595]
Output
[ 0.33933346]

Okay, thats the network’s guess for one input…. no where near the correct answer (1). Let’s look at what the network predicts for the other input patterns. Below I create a feedfoward, 1-layer neural network and plot the neural nets’ guesses to the four input patterns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def layer_InputOutput(Inputs,Weights): #find a layers input and activity
    Inputs_with_bias = np.append(Inputs,1.0) #input 1 for each unit's bias
    return logistic(np.dot(Inputs_with_bias,Weights.T))

def neural_net(Input,Weights_1,Weights_2,Training=False): #this function creates and runs the neural net    

    target = 1 #set target value
    if np.array(Input[0])==np.array([Input[1]]): target = 0 #change target value if needed

    #forward pass
    Hidden_Units = layer_InputOutput(Input,Weights_1) #find hidden unit activity
    Output = layer_InputOutput(Hidden_Units,Weights_2) #find Output layer actiity

    return {'output':Output,'target':target,'input':Input} #record trial output

Train_Set = [[1.0,1.0],[0.0,1.0],[1.0,0.0],[0.0,0.0]] #the four input patterns
tempdict = {'output':[],'target':[],'input':[]} #data dictionary
temp = [neural_net(Input,Weights_1,Weights_2) for Input in Train_Set] #get the data
[tempdict[key].append([temp[x][key] for x in range(len(temp))]) for key in tempdict] #combine all the output dictionaries

plotter = np.ones((2,2))*np.reshape(np.array(tempdict['output']),(2,2))
plt.pcolor(plotter,vmin=0,vmax=1,cmap=plt.cm.bwr)
plt.colorbar(ticks=[0,0.25,0.5,0.75,1]);
plt.xlabel('Input 1')
plt.ylabel('Input 2')
plt.xticks([0.5,1.5], ['0','1'])
plt.yticks([0.5,1.5], ['0','1']);

In the plot above, I have Input 1 on the x-axis and Input 2 on the y-axis. So if the Input is [0,0], the network produces the activity depicted in the lower left square. If the Input is [1,0], the network produces the activity depicted in the lower right square. If the network produces an output of 0, then the square will be blue. If the network produces an output of 1, then the square will be red. As you can see, the network produces all output between 0.25 and 0.5… no where near the correct answers.

So how do we update the weights in order to reduce the error between our guess and the correct answer?

First, we will do backpropogation between the output and hidden layers. This is exactly the same as backpropogation in the previous post.

In the previous post I described how our goal was to decrease error by changing the weights between units. This is the equation we used to describe changes in error with changes in the weights. The equation below expresses changes in error with changes to weights between the and the Output unit.

Now multiply this weight adjustment by the learning rate.

Finally, we apply the weight adjustment to .

Now lets do the same thing, but for both the weights and in the code.

1
2
3
4
5
6
7
8
alpha = 0.5 #learning rate
target = 1 #target outpu

error = target - Output #amount of error
delta_out = np.atleast_2d(error*(Output*(1-Output))) #first two terms of error by weight derivative

Hidden_Units = np.append(Hidden_Units,1.0) #add an input of 1 for the bias
print Weights_2 + alpha*np.outer(delta_out,Hidden_Units) #apply weight change
[[-0.21252673 -0.96033892 -0.29229558]]

The hidden layer changes things when we do backpropogation. Above, we computed the new weights using the output unit’s error. Now, we want to find how adjusting a weight changes the error, but this weight connects an input to the hidden layer rather than connecting to the output layer. This means we have to propogate the error backwards to the hidden layer.

We will describe backpropogation for the line connecting and as

Pretty similar. We just replaced Output with . The interpretation (starting with the final term and moving left) is that changing the changes ’s input. Changing ’s input changes ’s activity. Changing ’s activity changes the error. This last assertion (the first term) is where things get complicated. Lets take a closer look at this first term

Changing ’s activity changes changes the input to the Output unit. Changing the output unit’s input changes the error. hmmmm still not quite there yet. Lets look at how changes to the output unit’s input changes the error.

You can probably see where this is going. Changing the output unit’s input changes the output unit’s activity. Changing the output unit’s activity changes error. There we go.

Okay, this got a bit heavy, but here comes some good news. Compare the two terms of the equation above to the first two terms of our original backpropogation equation. They’re the same! Now lets look at (the second term from the first equation after our new backpropogation equation).

Again, I am glossing over how to derive these partial derivatives. For a more complete explantion, I recommend Chapter 8 of Rumelhart and McClelland’s PDP book. Nonetheless, this means we can take the output of our function delta_output multiplied by and we have the first term of our backpropogation equation! We want to be the weight used in the forward pass. Not the updated weight.

The second two terms from our backpropogation equation are the same as in our original backpropogation equation.

- this is specific to logistic activation functions.

and

Lets try and write this out.

It’s not short, but its doable. Let’s plug in the numbers.

Not too bad. Now lets see the code.

1
2
3
4
delta_hidden = delta_out.dot(Weights_2)*(Hidden_Units*(1-Hidden_Units)) #find delta portion of weight update

delta_hidden = np.delete(delta_hidden,2) #remove the bias input
print Weights_1 + alpha*np.outer(delta_hidden,np.append(Input,1.0)) #append bias input and multiply input by delta portion
[[-0.25119612 -0.50149299 -0.77809147]
 [-0.80193714 -0.23946929 -0.84467792]]

Alright! Lets implement all of this into a single model and train the model on the XOR problem. Below I create a neural network that includes both a forward pass and an optional backpropogation pass.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def neural_net(Input,Weights_1,Weights_2,Training=False): #this function creates and runs the neural net    

    target = 1 #set target value
    if np.array(Input[0])==np.array([Input[1]]): target = 0 #change target value if needed

    #forward pass
    Hidden_Units = layer_InputOutput(Input,Weights_1) #find hidden unit activity
    Output = layer_InputOutput(Hidden_Units,Weights_2) #find Output layer actiity

    if Training == True:
        alpha = 0.5 #learning rate

        Weights_2 = np.atleast_2d(Weights_2) #make sure this weight vector is 2d.

        error = target - Output #error
        delta_out = np.atleast_2d(error*(Output*(1-Output))) #delta between output and hidden

        Hidden_Units = np.append(Hidden_Units,1.0) #append an input for the bias
        delta_hidden = delta_out.dot(np.atleast_2d(Weights_2))*(Hidden_Units*(1-Hidden_Units)) #delta between hidden and input

        Weights_2 += alpha*np.outer(delta_out,Hidden_Units) #update weights

        delta_hidden = np.delete(delta_hidden,2) #remove bias activity
        Weights_1 += alpha*np.outer(delta_hidden,np.append(Input,1.0))  #update weights

    if Training == False:
        return {'output':Output,'target':target,'input':Input} #record trial output
    elif Training == True:
        return {'Weights_1':Weights_1,'Weights_2':Weights_2,'target':target,'output':Output,'error':error}

Okay, thats the network. Below, I train the network until its answers are very close to the correct answer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from random import choice
np.random.seed(seed=10) #seed random number generator for reproducibility

Weights_2 = np.random.rand(1,3)-0.5*2 #connections between hidden and output
Weights_1 = np.random.rand(2,3)-0.5*2 #connections between input and hidden

Weight_Dict = {'Weights_1':Weights_1,'Weights_2':Weights_2}

Train_Set = [[1.0,1.0],[0.0,0.0],[0.0,1.0],[1.0,0.0]] #train set

Error = []
while True: #train the neural net
    Train_Dict = neural_net(choice(Train_Set),Weight_Dict['Weights_1'],Weight_Dict['Weights_2'],Training=True)

    Error.append(abs(Train_Dict['error']))
    if len(Error) > 6 and np.mean(Error[-10:]) < 0.025: break #tell the code to stop iterating when recent mean error is small

Lets see how error changed across training

1
2
3
4
Error_vec = np.array(Error)[:,0]
plt.plot(Error_vec)
plt.ylabel('Error')
plt.xlabel('Iteration #');

Really cool. The network start with volatile error - sometimes being nearly correct ans sometimes being completely incorrect. Then After about 5000 iterations, the network starts down the slow path of perfecting an answer scheme. Below, I create a plot depicting the networks’ activity for the different input patterns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Weights_1 = Weight_Dict['Weights_1']
Weights_2 = Weight_Dict['Weights_2']

Train_Set = [[1.0,1.0],[0.0,1.0],[1.0,0.0],[0.0,0.0]] #train set

tempdict = {'output':[],'target':[],'input':[]} #data dictionary
temp = [neural_net(Input,Weights_1,Weights_2) for Input in Train_Set] #get the data
[tempdict[key].append([temp[x][key] for x in range(len(temp))]) for key in tempdict] #combine all the output dictionaries

plotter = np.ones((2,2))*np.reshape(np.array(tempdict['output']),(2,2))
plt.pcolor(plotter,vmin=0,vmax=1,cmap=plt.cm.bwr)
plt.colorbar(ticks=[0,0.25,0.5,0.75,1]);
plt.xlabel('Input 1')
plt.ylabel('Input 2')
plt.xticks([0.5,1.5], ['0','1'])
plt.yticks([0.5,1.5], ['0','1']);

Again, the Input 1 value is on the x-axis and the Input 2 value is on the y-axis. As you can see, the network guesses 1 when the inputs are different and it guesses 0 when the inputs are the same. Perfect! Below I depict the network with these correct weights.

1
2
3
4
5
6
7
8
9
10
Weight_Dict = {'Weights_1':Weights_1,'Weights_2':Weights_2}

network = NeuralNetwork()
network.add_layer(2,['Input 1','Input 2'],
                  [[round(x,2) for x in Weight_Dict['Weights_1'][0][:2]],
                   [round(x,2) for x in Weight_Dict['Weights_1'][1][:2]]])
network.add_layer(2,[round(Weight_Dict['Weights_1'][0][2],2),round(Weight_Dict['Weights_1'][1][2],2)],
                  [round(x,2) for x in Weight_Dict['Weights_2'][:2][0]])
network.add_layer(1,[round(Weight_Dict['Weights_2'][0][2],2)])
network.draw()

The network finds a pretty cool solution. Both hidden units are relatively active, but one hidden unit sends a strong postitive signal and the other sends a strong negative signal. The output unit has a negative bias, so if neither input is on, it will have an activity around 0. If both Input units are on, then the hidden unit that sends a postitive signal will be inhibited, and the output unit will have activity near 0. Otherwise, the hidden unit with a positive signal gives the output unit an acitivty near 1.

This is all well and good, but if you try to train this network with random weights you might find that it produces an incorrect set of weights sometimes. This is because the network runs into a local minima. A local minima is an instance when any change in the weights would increase the error, so the network is left with a sub-optimal set of weights.

Below I hand-pick of set of weights that produce a local optima.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Weights_2 = np.array([-4.5,5.3,-0.8]) #connections between hidden and output
Weights_1 = np.array([[-2.0,9.2,2.0],
                     [4.3,8.8,-0.1]])#connections between input and hidden

Weight_Dict = {'Weights_1':Weights_1,'Weights_2':Weights_2}

network = NeuralNetwork()
network.add_layer(2,['Input 1','Input 2'],
                  [[round(x,2) for x in Weight_Dict['Weights_1'][0][:2]],
                   [round(x,2) for x in Weight_Dict['Weights_1'][1][:2]]])
network.add_layer(2,[round(Weight_Dict['Weights_1'][0][2],2),round(Weight_Dict['Weights_1'][1][2],2)],
                  [round(x,2) for x in Weight_Dict['Weights_2'][:2]])
network.add_layer(1,[round(Weight_Dict['Weights_2'][2],2)])
network.draw()

Using these weights as the start of the training set, lets see what the network will do with training.

1
2
3
4
5
6
7
8
9
10
11
12
13
Train_Set = [[1.0,1.0],[0.0,0.0],[0.0,1.0],[1.0,0.0]] #train set

Error = []
while True:
    Train_Dict = neural_net(choice(Train_Set),Weight_Dict['Weights_1'],Weight_Dict['Weights_2'],Training=True)

    Error.append(abs(Train_Dict['error']))
    if len(Error) > 6 and np.mean(Error[-10:]) < 0.025: break

Error_vec = np.array(Error)[:]
plt.plot(Error_vec)
plt.ylabel('Error')
plt.xlabel('Iteration #');

As you can see the network never reduces error. Let’s see how the network answers to the different input patterns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Weights_1 = Weight_Dict['Weights_1']
Weights_2 = Weight_Dict['Weights_2']

Train_Set = [[1.0,1.0],[0.0,1.0],[1.0,0.0],[0.0,0.0]] #train set

tempdict = {'output':[],'target':[],'input':[]} #data dictionary
temp = [neural_net(Input,Weights_1,Weights_2) for Input in Train_Set] #get the data
[tempdict[key].append([temp[x][key] for x in range(len(temp))]) for key in tempdict] #combine all the output dictionaries

plotter = np.ones((2,2))*np.reshape(np.array(tempdict['output']),(2,2))
plt.pcolor(plotter,vmin=0,vmax=1,cmap=plt.cm.bwr)
plt.colorbar(ticks=[0,0.25,0.5,0.75,1]);
plt.xlabel('Input 1')
plt.ylabel('Input 2')
plt.xticks([0.5,1.5], ['0','1'])
plt.yticks([0.5,1.5], ['0','1']);

Looks like the network produces the correct answer in some cases but not others. The network is particularly confused when Inputs 2 is 0. Below I depict the weights after “training.” As you can see, they have not changed too much from where the weights started before training.

1
2
3
4
5
6
7
8
9
10
11
12
13
Weights_1 = Weight_Dict['Weights_1']
Weights_2 = Weight_Dict['Weights_2']

Weight_Dict = {'Weights_1':Weights_1,'Weights_2':Weights_2}

network = NeuralNetwork()
network.add_layer(2,['Input 1','Input 2'],
                  [[round(x,2) for x in Weight_Dict['Weights_1'][0][:2]],
                   [round(x,2) for x in Weight_Dict['Weights_1'][1][:2]]])
network.add_layer(2,[round(Weight_Dict['Weights_1'][0][2],2),round(Weight_Dict['Weights_1'][1][2],2)],
                  [round(x,2) for x in Weight_Dict['Weights_2'][:2]])
network.add_layer(1,[round(Weight_Dict['Weights_2'][2],2)])
network.draw()

This network was unable to push itself out of the local optima. While local optima are a problem, they’re are a couple things we can do to avoid them. First, we should always train a network multiple times with different random weights in order to test for local optima. If the network continually finds local optima, then we can increase the learning rate. By increasing the learning rate, the network can escape local optima in some cases. This should be done with care though as too big of a learning rate can also prevent finding the global minima.

Alright, that’s it. Obviously the neural network behind alpha go is much more complex than this one, but I would guess that while alpha go is much larger the basic computations underlying it are similar.

Hopefully these posts have given you an idea for how neural networks function and why they’re so cool!

An Introduction to Neural Networks: Part 1

We use our most advanced technologies as metaphors for the brain: The industrial revolution inspired descriptions of the brain as mechanical. The telephone inspired descriptions of the brain as a telephone switchboard. The computer inspired descriptions of the brain as a computer. Recently, we have reached a point where our most advanced technologies - such as AI (e.g., Alpha Go), and our current understanding of the brain inform each other in an awesome synergy. Neural networks exemplify this synergy. Neural networks offer a relatively advanced description of the brain and are the software underlying some of our most advanced technology. As our understanding of the brain increases, neural networks become more sophisticated. As our understanding of neural networks increases, our understanding of the brain becomes more sophisticated.

With the recent success of neural networks, I thought it would be useful to write a few posts describing the basics of neural networks.

First, what are neural networks - neural networks are a family of machine learning algorithms that can learn data’s underlying structure. Neural networks are composed of many neurons that perform simple computations. By performing many simple computations, neural networks can answer even the most complicated problems.

Lets get started.

As usual, I will post this code as a jupyter notebook on my github.

1
2
3
4
import numpy as np #import important libraries.
from matplotlib import pyplot as plt
import pandas as pd
%matplotlib inline

When talking about neural networks, it’s nice to visualize the network with a figure. For drawing the neural networks, I forked a repository from miloharper and made some changes so that this repository could be imported into python and so that I could label the network. Here is my forked repository.

1
2
3
4
5
6
7
from visualise_neural_network import NeuralNetwork

network = NeuralNetwork() #create neural network object
network.add_layer(2,['Input A','Input B'],['Weight A','Weight B']) #create the input layer which has two neurons.
#Each input neuron has a single line extending to the next layer up
network.add_layer(1,['Output']) #create output layer - a single output neuron
network.draw() #draw the network

Above is our neural network. It has two input neurons and a single output neuron. In this example, I’ll give the network an input of [0 1]. This means Input A will receive an input value of 0 and Input B will have an input value of 1.

The input is the input unit’s activity. This activity is sent to the Output unit, but the activity changes when traveling to the Output unit. The weights between the input and output units change the activity. A large positive weight between the input and output units causes the input unit to send a large positive (excitatory) signal. A large negative weight between the input and output units causes the input unit to send a large negative (inhibitory) signal. A weight near zero means the input unit does not influence the output unit.

In order to know the Output unit’s activity, we need to know its input. I will refer to the output unit’s input as . Here is how we can calculate

a more general way of writing this is

Let’s pretend the inputs are [0 1] and the Weights are [0.25 0.5]. Here is the input to the output neuron -

Thus, the input to the output neuron is 0.5. A quick way of programming this is through the function numpy.dot which finds the dot product of two vectors (or matrices). This might sound a little scary, but in this case its just multiplying the items by each other and then summing everything up - like we did above.

1
2
3
4
5
Inputs = np.array([0, 1])
Weights = np.array([0.25, 0.5])

net_Output = np.dot(Inputs,Weights)
print net_Output
0.5

All this is good, but we haven’t actually calculated the output unit’s activity we have only calculated its input. What makes neural networks able to solve complex problems is they include a non-linearity when translating the input into activity. In this case we will translate the input into activity by putting the input through a logistic function.

1
2
def logistic(x): #each neuron has a logistic activation function
    return 1.0/(1+np.exp(-x))

Lets take a look at a logistic function.

1
2
3
4
x = np.arange(-5,5,0.1) #create vector of numbers between -5 and 5
plt.plot(x,logistic(x))
plt.ylabel('Activation')
plt.xlabel('Input');

As you can see above, the logistic used here transforms negative values into values near 0 and positive values into values near 1. Thus, when a unit receives a negative input it has activity near zero and when a unit receives a postitive input it has activity near 1. The most important aspect of this activation function is that its non-linear - it’s not a straight line.

Now lets see the activity of our output neuron. Remember, the net input is 0.5

1
2
3
4
5
6
Output_neuron = logistic(net_Output)
print Output_neuron
plt.plot(x,logistic(x));
plt.ylabel('Activation')
plt.xlabel('Input')
plt.plot(net_Output,Output_neuron,'ro');
0.622459331202

The activity of our output neuron is depicted as the red dot.

So far I’ve described how to find a unit’s activity, but I haven’t described how to find the weights of connections between units. In the example above, I chose the weights to be 0.25 and 0.5, but I can’t arbitrarily decide weights unless I already know the solution to the problem. If I want the network to find a solution for me, I need the network to find the weights itself.

In order to find the weights of connections between neurons, I will use an algorithm called backpropogation. In backpropogation, we have the neural network guess the answer to a problem and adjust the weights so that this guess gets closer and closer to the correct answer. Backpropogation is the method by which we reduce the distance between guesses and the correct answer. After many iterations of guesses by the neural network and weight adjustments through backpropogation, the network can learn an answer to a problem.

Lets say we want our neural network to give an answer of 0 when the left input unit is active and an answer of 1 when the right unit is active. In this case the inputs I will use are [1,0] and [0,1]. The corresponding correct answers will be [0] and [1].

Lets see how close our network is to the correct answer. I am using the weights from above ([0.25, 0.5]).

1
2
3
4
5
6
7
8
9
10
11
Inputs = [[1,0],[0,1]]
Answers = [0,1,]

Guesses = [logistic(np.dot(x,Weights)) for x in Inputs] #loop through inputs and find logistic(sum(input*weights))
plt.plot(Guesses,'bo')
plt.plot(Answers,'ro')
plt.axis([-0.5,1.5,-0.5,1.5])
plt.ylabel('Activation')
plt.xlabel('Input #')
plt.legend(['Guesses','Answers']);
print Guesses
[0.56217650088579807, 0.62245933120185459]

The guesses are in blue and the answers are in red. As you can tell, the guesses and the answers look almost nothing alike. Our network likes to guess around 0.6 while the correct answer is 0 in the first example and 1 in the second.

Lets look at how backpropogation reduces the distance between our guesses and the correct answers.

First, we want to know how the amount of error changes with an adjustment to a given weight. We can write this as

This change in error with changes in the weights has a number of different sub components.

  • Changes in error with changes in the output unit’s activity:
  • Changes in the output unit’s activity with changes in this unit’s input:
  • Changes in the output unit’s input with changes in the weight:

Through the chain rule we know

This might look scary, but with a little thought it should make sense: (starting with the final term and moving left) When we change the weight of a connection to a unit, we change the input to that unit. When we change the input to a unit, we change its activity (written Output above). When we change a units activity, we change the amount of error.

Let’s break this down using our example. During this portion, I am going to gloss over some details about how exactly to derive the partial derivatives. Wikipedia has a more complete derivation.

In the first example, the input is [1,0] and the correct answer is [0]. Our network’s guess in this example was about 0.56.

Please note that this is specific to our example with a logistic activation function

To summarize:

This is the direction we want to move in, but taking large steps in this direction can prevent us from finding the optimal weights. For this reason, we reduce our step size. We will reduce our step size with a parameter called the learning rate (). is bound between 0 and 1.

Here is how we can write our change in weights

This is known as the delta rule.

We will set to be 0.5. Here is how we will calculate the new .

Thus, is shrinking which will move the output towards 0. Below I write the code to implement our backpropogation.

1
2
3
4
5
6
7
alpha = 0.5

def delta_Output(target,Output):
    return -(target-Output)*Output*(1-Output) #find the amount of error and derivative of activation function

def update_weights(alpha,delta,unit_input):
    return alpha*np.outer(delta,unit_input) #multiply delta output by all the inputs and then multiply these by the learning rate

Above I use the outer product of our delta function and the input in order to spread the weight changes to all lines connecting to the output unit.

Okay, hopefully you made it through that. I promise thats as bad as it gets. Now that we’ve gotten through the nasty stuff, lets use backpropogation to find an answer to our problem.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def network_guess(Input,Weights):
    return logistic(np.dot(Input,Weights.T)) #input by weights then through a logistic

def back_prop(Input,Output,target,Weights):
    delta = delta_Output(target,Output) #find delta portion
    delta_weight = update_weights(alpha,delta,Input) #find amount to update weights
    Weights = np.atleast_2d(Weights) #convert weights to array
    Weights += -delta_weight #update weights
    return Weights

from random import choice, seed
seed(1) #seed random number generator so that these results can be replicated

Weights = np.array([0.25, 0.5])

Error = []
while True:

    Trial_Type = choice([0,1]) #generate random number to choose between the two inputs

    Input = np.atleast_2d(Inputs[Trial_Type]) #choose input and convert to array
    Answer = Answers[Trial_Type] #get the correct answer

    Output = network_guess(Input,Weights) #compute the networks guess
    Weights = back_prop(Input,Output,Answer,Weights) #change the weights based on the error

    Error.append(abs(Output-Answer)) #record error

    if len(Error) > 6 and np.mean(Error[-5:]) < 0.05: break #tell the code to stop iterating when mean error is < 0.05 in the last 5 guesses

It seems our code has found an answer, so lets see how the amount of error changed as the code progressed.

1
2
3
4
Error_vec = np.array(Error)[:,0]
plt.plot(Error_vec)
plt.ylabel('Error')
plt.xlabel('Iteration #');

It looks like the while loop excecuted about 1000 iterations before converging. As you can see the error decreases. Quickly at first then slowly as the weights zone in on the correct answer. lets see how our guesses compare to the correct answers.

1
2
3
4
5
6
7
8
9
10
11
Inputs = [[1,0],[0,1]]
Answers = [0,1,]

Guesses = [logistic(np.dot(x,Weights.T)) for x in Inputs] #loop through inputs and find logistic(sum(input*weights))
plt.plot(Guesses,'bo')
plt.plot(Answers,'ro')
plt.axis([-0.5,1.5,-0.5,1.5])
plt.ylabel('Activation')
plt.xlabel('Input #')
plt.legend(['Guesses','Answers']);
print Guesses
[array([ 0.05420561]), array([ 0.95020512])]

Not bad! Our guesses are much closer to the correct answers than before we started running the backpropogation procedure! Now, you might say, “HEY! But you haven’t reached the correct answers.” That’s true, but note that acheiving the values of 0 and 1 with a logistic function are only possible at - and , respectively. Because of this, we treat 0.05 as 0 and 0.95 as 1.

Okay, all this is great, but that was a really simple problem, and I said that neural networks could solve interesting problems!

Well… this post is already longer than I anticipated. I will follow-up this post with another post explaining how we can expand neural networks to solve more interesting problems.

Revisiting NBA Career Predictions From Rookie Performance

In this post I wanted to do a quick follow up to a previous post about predicting career nba performance from rookie year data.

After my previous post, I started to get a little worried about my career prediction model. Specifically, I started to wonder about whether my model was underfitting or overfitting the data. Underfitting occurs when the model has too much “bias” and cannot accomodate the data’s shape. Overfitting occurs when the model is too flexible and can account for all variance in a data set - even variance due to noise. In this post, I will quickly re-create my player prediction model, and investigate whether underfitting and overfitting are a problem.

Because this post largely repeats a previous one, I haven’t written quite as much about the code. If you would like to read more about the code, see my previous posts.

As usual, I will post all code as a jupyter notebook on my github.

1
2
3
4
5
6
7
8
9
#import some libraries and tell ipython we want inline figures rather than interactive figures.
import matplotlib.pyplot as plt, pandas as pd, numpy as np, matplotlib as mpl

from __future__ import print_function

%matplotlib inline
pd.options.display.mpl_style = 'default' #load matplotlib for plotting
plt.style.use('ggplot') #im addicted to ggplot. so pretty.
mpl.rcParams['font.family'] = ['Bitstream Vera Sans']

Load the data. Reminder - this data is still available on my github.

1
2
3
4
5
6
7
8
rookie_df = pd.read_pickle('nba_bballref_rookie_stats_2016_Mar_15.pkl') #here's the rookie year data

rook_games = rookie_df['Career Games']>50
rook_year = rookie_df['Year']>1980

#remove rookies from before 1980 and who have played less than 50 games. I also remove some features that seem irrelevant or unfair
rookie_df_games = rookie_df[rook_games & rook_year] #only players with more than 50 games.
rookie_df_drop = rookie_df_games.drop(['Year','Career Games','Name'],1)

Load more data, and normalize it data for the PCA transformation.

1
2
3
4
5
6
7
8
from sklearn.preprocessing import StandardScaler

df = pd.read_pickle('nba_bballref_career_stats_2016_Mar_15.pkl')
df = df[df['G']>50]
df_drop = df.drop(['Year','Name','G','GS','MP','FG','FGA','FG%','3P','2P','FT','TRB','PTS','ORtg','DRtg','PER','TS%','3PAr','FTr','ORB%','DRB%','TRB%','AST%','STL%','BLK%','TOV%','USG%','OWS','DWS','WS','WS/48','OBPM','DBPM','BPM','VORP'],1)
X = df_drop.as_matrix() #take data out of dataframe
ScaleModel = StandardScaler().fit(X)
X = ScaleModel.transform(X)

Use k-means to group players according to their performance. See my post on grouping players for more info.

1
2
3
4
5
6
7
8
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

reduced_model = PCA(n_components=5, whiten=True).fit(X)

reduced_data = reduced_model.transform(X) #transform data into the 5 PCA components space
final_fit = KMeans(n_clusters=6).fit(reduced_data) #fit 6 clusters
df['kmeans_label'] = final_fit.labels_ #label each data point with its clusters

Run a separate regression on each group of players. I calculate mean absolute error (a variant of mean squared error) for each model. I used mean absolute error because it’s on the same scale as the data, and easier to interpret. I will use this later to evaluate just how accurate these models are. Quick reminder - I am trying to predict career WS/48 with MANY predictor variables from rookie year performance such rebounding and scoring statistics.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import statsmodels.api as sm
from sklearn.metrics import mean_absolute_error #import function for calculating mean squared error.

X = rookie_df.as_matrix() #take data out of dataframe

cluster_labels = df[df['Year']>1980]['kmeans_label']
rookie_df_drop['kmeans_label'] = cluster_labels #label each data point with its clusters

plt.figure(figsize=(8,6));

estHold = [[],[],[],[],[],[]]

score = []

for i,group in enumerate(np.unique(final_fit.labels_)):

    Grouper = df['kmeans_label']==group #do one regression at a time
    Yearer = df['Year']>1980

    Group1 = df[Grouper & Yearer]
    Y = Group1['WS/48'] #get predictor data

    Group1_rookie = rookie_df_drop[rookie_df_drop['kmeans_label']==group]
    Group1_rookie = Group1_rookie.drop(['kmeans_label'],1) #get predicted data

    X = Group1_rookie.as_matrix() #take data out of dataframe    

    X = sm.add_constant(X)  # Adds a constant term to the predictor
    est = sm.OLS(Y,X) #fit with linear regression model
    est = est.fit()
    estHold[i] = est
    score.append(mean_absolute_error(Y,est.predict(X))) #calculate the mean squared error
    #print est.summary()

    plt.subplot(3,2,i+1) #plot each regression's prediction against actual data
    plt.plot(est.predict(X),Y,'o',color=plt.rcParams['axes.color_cycle'][i])
    plt.plot(np.arange(-0.1,0.25,0.01),np.arange(-0.1,0.25,0.01),'-')
    plt.title('Group %d'%(i+1))
    plt.text(0.15,-0.05,'$r^2$=%.2f'%est.rsquared)
    plt.xticks([0.0,0.12,0.25])
    plt.yticks([0.0,0.12,0.25]);

More quick reminders - predicted performances are on the Y-axis, actual performances are on the X-axis, and the red line is the identity line. Thus far, everything has been exactly the same as my previous post (although my group labels are different).

I want to investigate whether the model is overfitting the data. If the data is overfitting the data, then the error should go up when training and testing with different datasets (because the model was fitting itself to noise and noise changes when the datasets change). To investigate whether the model overfits the data, I will evaluate whether the model “generalizes” via cross-validation.

The reason I’m worried about overfitting is I used a LOT of predictors in these models and the number of predictors might have allowed the model the model to fit noise in the predictors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from sklearn.linear_model import LinearRegression #I am using sklearns linear regression because it plays well with their cross validation function
from sklearn import cross_validation #import the cross validation function

X = rookie_df.as_matrix() #take data out of dataframe

cluster_labels = df[df['Year']>1980]['kmeans_label']
rookie_df_drop['kmeans_label'] = cluster_labels #label each data point with its clusters

for i,group in enumerate(np.unique(final_fit.labels_)):

    Grouper = df['kmeans_label']==group #do one regression at a time
    Yearer = df['Year']>1980

    Group1 = df[Grouper & Yearer]
    Y = Group1['WS/48'] #get predictor data

    Group1_rookie = rookie_df_drop[rookie_df_drop['kmeans_label']==group]
    Group1_rookie = Group1_rookie.drop(['kmeans_label'],1) #get predicted data

    X = Group1_rookie.as_matrix() #take data out of dataframe    
    X = sm.add_constant(X)  # Adds a constant term to the predictor

    est = LinearRegression() #fit with linear regression model

    this_scores = cross_validation.cross_val_score(est, X, Y,cv=10, scoring='mean_absolute_error') #find mean square error across different datasets via cross validations
    print('Group '+str(i))
    print('Initial Mean Absolute Error: '+str(score[i])[0:6])
    print('Cross Validation MAE: '+str(np.median(np.abs(this_scores)))[0:6]) #find the mean MSE across validations
Group 0
Initial Mean Absolute Error: 0.0161
Cross Validation MAE: 0.0520
Group 1
Initial Mean Absolute Error: 0.0251
Cross Validation MAE: 0.0767
Group 2
Initial Mean Absolute Error: 0.0202
Cross Validation MAE: 0.0369
Group 3
Initial Mean Absolute Error: 0.0200
Cross Validation MAE: 0.0263
Group 4
Initial Mean Absolute Error: 0.0206
Cross Validation MAE: 0.0254
Group 5
Initial Mean Absolute Error: 0.0244
Cross Validation MAE: 0.0665

Above I print out the model’s initial mean absolute error and median absolute error when fitting cross-validated data.

The models definitely have more error when cross validated. The change in error is worse in some groups than others. For instance, error dramatically increases in Group 1. Keep in mind that the scoring measure here is mean absolute error, so error is in the same scale as WS/48. An average error of 0.04 in WS/48 is sizable, leaving me worried that the models overfit the data.

Unfortunately, Group 1 is the “scorers” group, so the group with most the interesting players is where the model fails most…

Next, I will look into whether my models underfit the data. I am worried that my models underfit the data because I used linear regression, which has very little flexibility. To investigate this, I will plot the residuals of each model. Residuals are the error between my model’s prediction and the actual performance.

Linear regression assumes that residuals are uncorrelated and evenly distributed around 0. If this is not the case, then the linear regression is underfitting the data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#plot the residuals. there's obviously a problem with under/over prediction

plt.figure(figsize=(8,6));

for i,group in enumerate(np.unique(final_fit.labels_)):

    Grouper = df['kmeans_label']==group #do one regression at a time
    Yearer = df['Year']>1980

    Group1 = df[Grouper & Yearer]
    Y = Group1['WS/48'] #get predictor data
    resid = estHold[i].resid #extract residuals

    plt.subplot(3,2,i+1) #plot each regression's prediction against actual data
    plt.plot(Y,resid,'o',color=plt.rcParams['axes.color_cycle'][i])
    plt.title('Group %d'%(i+1))
    plt.xticks([0.0,0.12,0.25])
    plt.yticks([-0.1,0.0,0.1]);

Residuals are on the Y-axis and career performances are on the X-axis. Negative residuals are over predictions (the player is worse than my model predicts) and postive residuals are under predictions (the player is better than my model predicts). I don’t test this, but the residuals appear VERY correlated. That is, the model tends to over estimate bad players (players with WS/48 less than 0.0) and under estimate good players. Just to clarify, non-correlated residuals would have no apparent slope.

This means the model is making systematic errors and not fitting the actual shape of the data. I’m not going to say the model is damned, but this is an obvious sign that the model needs more flexibility.

No model is perfect, but this model definitely needs more work. I’ve been playing with more flexible models and will post these models here if they do a better job predicting player performance.

Predicting Career Performance From Rookie Performance

As a huge t-wolves fan, I’ve been curious all year by what we can infer from Karl-Anthony Towns’ great rookie season. To answer this question, I’ve create a simple linear regression model that uses rookie year performance to predict career performance.

Many have attempted to predict NBA players’ success via regression style approaches. Notable models I know of include Layne Vashro’s model which uses combine and college performance to predict career performance. Layne Vashro’s model is a quasi-poisson GLM. I tried a similar approach, but had the most success when using ws/48 and OLS. I will discuss this a little more at the end of the post.

A jupyter notebook of this post can be found on my github.

1
2
3
4
5
6
7
8
9
#import some libraries and tell ipython we want inline figures rather than interactive figures.
import matplotlib.pyplot as plt, pandas as pd, numpy as np, matplotlib as mpl

from __future__ import print_function

%matplotlib inline
pd.options.display.mpl_style = 'default' #load matplotlib for plotting
plt.style.use('ggplot') #im addicted to ggplot. so pretty.
mpl.rcParams['font.family'] = ['Bitstream Vera Sans']

I collected all the data for this project from basketball-reference.com. I posted the functions for collecting the data on my github. The data is also posted there. Beware, the data collection scripts take awhile to run.

This data includes per 36 stats and advanced statistics such as usage percentage. I simply took all the per 36 and advanced statistics from a player’s page on basketball-reference.com.

1
2
df = pd.read_pickle('nba_bballref_career_stats_2016_Mar_15.pkl') #here's the career data.
rookie_df = pd.read_pickle('nba_bballref_rookie_stats_2016_Mar_15.pkl') #here's the rookie year data

The variable I am trying to predict is average WS/48 over a player’s career. There’s no perfect box-score statistic when it comes to quantifying a player’s peformance, but ws/48 seems relatively solid.

1
2
3
4
5
6
7
8
Games = df['G']>50 #only using players who played in more than 50 games.
Year = df['Year']>1980 #only using players after 1980 when they started keeping many important records such as games started

Y = df[Games & Year]['WS/48'] #predicted variable

plt.hist(Y);
plt.ylabel('Bin Count')
plt.xlabel('WS/48');

The predicted variable looks pretty gaussian, so I can use ordinary least squares. This will be nice because while ols is not flexible, it’s highly interpretable. At the end of the post I’ll mention some more complex models that I will try.

1
2
3
4
5
6
rook_games = rookie_df['Career Games']>50
rook_year = rookie_df['Year']>1980

#remove rookies from before 1980 and who have played less than 50 games. I also remove some features that seem irrelevant or unfair
rookie_df_games = rookie_df[rook_games & rook_year] #only players with more than 50 games.
rookie_df_drop = rookie_df_games.drop(['Year','Career Games','Name'],1)

Above, I remove some predictors from the rookie data. Lets run the regression!

1
2
3
4
5
6
7
8
import statsmodels.api as sm

X_rookie = rookie_df_drop.as_matrix() #take data out of dataframe
X_rookie = sm.add_constant(X_rookie)  # Adds a constant term to the predictor

estAll = sm.OLS(Y,X_rookie) #create ordinary least squares model
estAll = estAll.fit() #fit the model
print(estAll.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  WS/48   R-squared:                       0.476
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     31.72
Date:                Sun, 20 Mar 2016   Prob (F-statistic):          2.56e-194
Time:                        15:29:43   Log-Likelihood:                 3303.9
No. Observations:                1690   AIC:                            -6512.
Df Residuals:                    1642   BIC:                            -6251.
Df Model:                          47                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.2509      0.078      3.223      0.001         0.098     0.404
x1            -0.0031      0.001     -6.114      0.000        -0.004    -0.002
x2            -0.0004   9.06e-05     -4.449      0.000        -0.001    -0.000
x3            -0.0003   8.12e-05     -3.525      0.000        -0.000    -0.000
x4          1.522e-05   4.73e-06      3.218      0.001      5.94e-06  2.45e-05
x5             0.0030      0.031      0.096      0.923        -0.057     0.063
x6             0.0109      0.019      0.585      0.559        -0.026     0.047
x7            -0.0312      0.094     -0.331      0.741        -0.216     0.154
x8             0.0161      0.027      0.594      0.553        -0.037     0.069
x9            -0.0054      0.018     -0.292      0.770        -0.041     0.031
x10            0.0012      0.007      0.169      0.866        -0.013     0.015
x11            0.0136      0.023      0.592      0.554        -0.031     0.059
x12           -0.0099      0.018     -0.538      0.591        -0.046     0.026
x13            0.0076      0.054      0.141      0.888        -0.098     0.113
x14            0.0094      0.012      0.783      0.433        -0.014     0.033
x15            0.0029      0.002      1.361      0.174        -0.001     0.007
x16            0.0078      0.009      0.861      0.390        -0.010     0.026
x17           -0.0107      0.019     -0.573      0.567        -0.047     0.026
x18           -0.0062      0.018     -0.342      0.732        -0.042     0.029
x19            0.0095      0.017      0.552      0.581        -0.024     0.043
x20            0.0111      0.004      2.853      0.004         0.003     0.019
x21            0.0109      0.018      0.617      0.537        -0.024     0.046
x22           -0.0139      0.006     -2.165      0.030        -0.026    -0.001
x23            0.0024      0.005      0.475      0.635        -0.008     0.012
x24            0.0022      0.001      1.644      0.100        -0.000     0.005
x25           -0.0125      0.012     -1.027      0.305        -0.036     0.011
x26           -0.0006      0.000     -1.782      0.075        -0.001  5.74e-05
x27           -0.0011      0.001     -1.749      0.080        -0.002     0.000
x28            0.0012      0.003      0.487      0.626        -0.004     0.006
x29            0.1824      0.089      2.059      0.040         0.009     0.356
x30           -0.0288      0.025     -1.153      0.249        -0.078     0.020
x31           -0.0128      0.011     -1.206      0.228        -0.034     0.008
x32           -0.0046      0.008     -0.603      0.547        -0.020     0.010
x33           -0.0071      0.005     -1.460      0.145        -0.017     0.002
x34            0.0131      0.012      1.124      0.261        -0.010     0.036
x35           -0.0023      0.001     -2.580      0.010        -0.004    -0.001
x36           -0.0077      0.013     -0.605      0.545        -0.033     0.017
x37            0.0069      0.004      1.916      0.055        -0.000     0.014
x38           -0.0015      0.001     -2.568      0.010        -0.003    -0.000
x39           -0.0002      0.002     -0.110      0.912        -0.005     0.004
x40           -0.0109      0.017     -0.632      0.528        -0.045     0.023
x41           -0.0142      0.017     -0.821      0.412        -0.048     0.020
x42            0.0217      0.017      1.257      0.209        -0.012     0.056
x43            0.0123      0.102      0.121      0.904        -0.188     0.213
x44            0.0441      0.018      2.503      0.012         0.010     0.079
x45            0.0406      0.018      2.308      0.021         0.006     0.075
x46           -0.0410      0.018     -2.338      0.020        -0.075    -0.007
x47            0.0035      0.003      1.304      0.192        -0.002     0.009
==============================================================================
Omnibus:                       42.820   Durbin-Watson:                   1.966
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               54.973
Skew:                           0.300   Prob(JB):                     1.16e-12
Kurtosis:                       3.649   Cond. No.                     1.88e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.88e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

There’s a lot to look at in the regression output (especially with this many features). For an explanation of all the different parts of the regression take a look at this post. Below is a quick plot of predicted ws/48 against actual ws/48.

1
2
3
4
plt.plot(estAll.predict(X_rookie),Y,'o')
plt.plot(np.arange(0,0.25,0.01),np.arange(0,0.25,0.01),'b-')
plt.ylabel('Career WS/48')
plt.xlabel('Predicted WS/48');

The blue line above is NOT the best-fit line. It’s the identity line. I plot it to help visualize where the model fails. The model seems to primarily fail in the extremes - it tends to overestimate the worst players.

All in all, This model does a remarkably good job given its simplicity (linear regression), but it also leaves a lot of variance unexplained.

One reason this model might miss some variance is there’s more than one way to be a productive basketball player. For instance, Dwight Howard and Steph Curry find very different ways to contribute. One linear regression model is unlikely to succesfully predict both players.

In a previous post, I grouped players according to their on-court performance. These player groupings might help predict career performance.

Below, I will use the same player grouping I developed in my previous post, and examine how these groupings impact my ability to predict career performance.

1
2
3
4
5
6
7
8
from sklearn.preprocessing import StandardScaler

df = pd.read_pickle('nba_bballref_career_stats_2016_Mar_15.pkl')
df = df[df['G']>50]
df_drop = df.drop(['Year','Name','G','GS','MP','FG','FGA','FG%','3P','2P','FT','TRB','PTS','ORtg','DRtg','PER','TS%','3PAr','FTr','ORB%','DRB%','TRB%','AST%','STL%','BLK%','TOV%','USG%','OWS','DWS','WS','WS/48','OBPM','DBPM','BPM','VORP'],1)
X = df_drop.as_matrix() #take data out of dataframe
ScaleModel = StandardScaler().fit(X)
X = ScaleModel.transform(X)
1
2
3
4
5
6
7
8
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

reduced_model = PCA(n_components=5, whiten=True).fit(X)

reduced_data = reduced_model.transform(X) #transform data into the 5 PCA components space
final_fit = KMeans(n_clusters=6).fit(reduced_data) #fit 6 clusters
df['kmeans_label'] = final_fit.labels_ #label each data point with its clusters

See my other post for more details about this clustering procedure.

Let’s see how WS/48 varies across the groups.

1
2
WS_48 = [df[df['kmeans_label']==x]['WS/48'] for x in np.unique(df['kmeans_label'])] #create a vector of ws/48. One for each cluster
plt.boxplot(WS_48);

Some groups perform better than others, but there’s lots of overlap between the groups. Importantly, each group has a fair amount of variability. Each group spans at least 0.15 WS/48. This gives the regression enough room to successfully predict performance in each group.

Now, lets get a bit of a refresher on what the groups are. Again, my previous post has a good description of these groups.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
TS = [np.mean(df[df['kmeans_label']==x]['TS%'])*100 for x in np.unique(df['kmeans_label'])] #create vectors of each stat for each cluster
ThreeAr = [np.mean(df[df['kmeans_label']==x]['3PAr'])*100 for x in np.unique(df['kmeans_label'])]
FTr = [np.mean(df[df['kmeans_label']==x]['FTr'])*100 for x in np.unique(df['kmeans_label'])]
RBD = [np.mean(df[df['kmeans_label']==x]['TRB%']) for x in np.unique(df['kmeans_label'])]
AST = [np.mean(df[df['kmeans_label']==x]['AST%']) for x in np.unique(df['kmeans_label'])]
STL = [np.mean(df[df['kmeans_label']==x]['STL%']) for x in np.unique(df['kmeans_label'])]
TOV = [np.mean(df[df['kmeans_label']==x]['TOV%']) for x in np.unique(df['kmeans_label'])]
USG = [np.mean(df[df['kmeans_label']==x]['USG%']) for x in np.unique(df['kmeans_label'])]

Data = np.array([TS,ThreeAr,FTr,RBD,AST,STL,TOV,USG])
ind = np.arange(1,9)

plt.figure(figsize=(16,8))
plt.plot(ind,Data,'o-',linewidth=2)
plt.xticks(ind,('True Shooting', '3 point Attempt', 'Free Throw Rate', 'Rebound', 'Assist','Steal','TOV','Usage'),rotation=45)
plt.legend(('Group 1','Group 2','Group 3','Group 4','Group 5','Group 6'))
plt.ylabel('Percent');

I’ve plotted the groups across a number of useful categories. For information about these categories see basketball reference’s glossary.

Here’s a quick rehash of the groupings. See my previous post for more detail.

  • **Group 1:** These are the distributors who shoot a fair number of threes, don't rebound at all, dish out assists, gather steals, and ...turn the ball over.
  • **Group 2:** These are the scorers who get to the free throw line, dish out assists, and carry a high usage.
  • **Group 3:** These are the bench players who don't score...or do much in general.
  • **Group 4:** These are the 3 point shooters who shoot tons of 3 pointers, almost no free throws, and don't rebound well.
  • **Group 5:** These are the mid-range shooters who shoot well, but don't shoot threes or draw free throws
  • **Group 6:** These are the defensive big men who shoot no threes, rebound lots, and carry a low usage.

On to the regression.

1
2
3
4
5
6
7
8
rookie_df = pd.read_pickle('nba_bballref_rookie_stats_2016_Mar_15.pkl')
rookie_df = rookie_df.drop(['Year','Career Games','Name'],1)

X = rookie_df.as_matrix() #take data out of dataframe
ScaleRookie = StandardScaler().fit(X) #scale data
X = ScaleRookie.transform(X) #transform data to scale

reduced_model_rookie = PCA(n_components=10).fit(X) #create pca model of first 10 components.

You might have noticed the giant condition number in the regression above. This indicates significant multicollinearity of the features, which isn’t surprising since I have many features that reflect the same abilities.

The multicollinearity doesn’t prevent the regression model from making accurate predictions, but does it make the beta weight estimates irratic. With irratic beta weights, it’s hard to tell whether the different clusters use different models when predicting career ws/48.

In the following regression, I put the predicting features through a PCA and keep only the first 10 PCA components. Using only the first 10 PCA components keeps the component score below 20, indicating that multicollinearity is not a problem. I then examine whether the different groups exhibit a different patterns of beta weights (whether different models predict success of the different groups).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
cluster_labels = df[df['Year']>1980]['kmeans_label'] #limit labels to players after 1980
rookie_df_drop['kmeans_label'] = cluster_labels #label each data point with its clusters

estHold = [[],[],[],[],[],[]]

for i,group in enumerate(np.unique(final_fit.labels_)):

    Grouper = df['kmeans_label']==group #do regression one group at a time
    Yearer = df['Year']>1980

    Group1 = df[Grouper & Yearer]
    Y = Group1['WS/48'] #get predicted data

    Group1_rookie = rookie_df_drop[rookie_df_drop['kmeans_label']==group] #get predictor data of group
    Group1_rookie = Group1_rookie.drop(['kmeans_label'],1)

    X = Group1_rookie.as_matrix() #take data out of dataframe
    X = ScaleRookie.transform(X) #scale data

    X = reduced_model_rookie.transform(X) #transform data into the 10 PCA components space

    X = sm.add_constant(X)  # Adds a constant term to the predictor
    est = sm.OLS(Y,X) #create regression model
    est = est.fit()
    #print(est.summary())
    estHold[i] = est
1
2
3
4
5
6
7
8
9
plt.figure(figsize=(12,6)) #plot the beta weights
width=0.12
for i,est in enumerate(estHold):
    plt.bar(np.arange(11)+width*i,est.params,color=plt.rcParams['axes.color_cycle'][i],width=width,yerr=(est.conf_int()[1]-est.conf_int()[0])/2)

plt.xlim(right=11)
plt.xlabel('Principle Components')
plt.legend(('Group 1','Group 2','Group 3','Group 4','Group 5','Group 6'))
plt.ylabel('Beta Weights');

Above I plot the beta weights for each principle component across the groupings. This plot is a lot to look at, but I wanted to depict how the beta values changed across the groups. They are not drastically different, but they’re also not identical. Error bars depict 95% confidence intervals.

Below I fit a regression to each group, but with all the features. Again, multicollinearity will be a problem, but this will not decrease the regression’s accuracy, which is all I really care about.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
X = rookie_df.as_matrix() #take data out of dataframe

cluster_labels = df[df['Year']>1980]['kmeans_label']
rookie_df_drop['kmeans_label'] = cluster_labels #label each data point with its clusters

plt.figure(figsize=(8,6));

estHold = [[],[],[],[],[],[]]

for i,group in enumerate(np.unique(final_fit.labels_)):

    Grouper = df['kmeans_label']==group #do one regression at a time
    Yearer = df['Year']>1980

    Group1 = df[Grouper & Yearer]
    Y = Group1['WS/48'] #get predictor data

    Group1_rookie = rookie_df_drop[rookie_df_drop['kmeans_label']==group]
    Group1_rookie = Group1_rookie.drop(['kmeans_label'],1) #get predicted data

    X = Group1_rookie.as_matrix() #take data out of dataframe    

    X = sm.add_constant(X)  # Adds a constant term to the predictor
    est = sm.OLS(Y,X) #fit with linear regression model
    est = est.fit()
    estHold[i] = est
    #print est.summary()

    plt.subplot(3,2,i+1) #plot each regression's prediction against actual data
    plt.plot(est.predict(X),Y,'o',color=plt.rcParams['axes.color_cycle'][i])
    plt.plot(np.arange(-0.1,0.25,0.01),np.arange(-0.1,0.25,0.01),'-')
    plt.title('Group %d'%(i+1))
    plt.text(0.15,-0.05,'$r^2$=%.2f'%est.rsquared)
    plt.xticks([0.0,0.12,0.25])
    plt.yticks([0.0,0.12,0.25]);

The plots above depict each regression’s predictions against actual ws/48. I provide each model’s r^2 in the plot too.

Some regressions are better than others. For instance, the regression model does a pretty awesome job predicting the bench warmers…I wonder if this is because they have shorter careers… The regression model does not do a good job predicting the 3-point shooters.

Now onto the fun stuff though.

Below, create a function for predicting a players career WS/48. First, I write a function that finds what cluster a player would belong to, and what the regression model predicts for this players career (with 95% confidence intervals).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def player_prediction__regressionModel(PlayerName):
    from statsmodels.sandbox.regression.predstd import wls_prediction_std

    clust_df = pd.read_pickle('nba_bballref_career_stats_2016_Mar_05.pkl')
    clust_df = clust_df[clust_df['Name']==PlayerName]
    clust_df = clust_df.drop(['Name','G','GS','MP','FG','FGA','FG%','3P','2P','FT','TRB','PTS','ORtg','DRtg','PER','TS%','3PAr','FTr','ORB%','DRB%','TRB%','AST%','STL%','BLK%','TOV%','USG%','OWS','DWS','WS','WS/48','OBPM','DBPM','BPM','VORP'],1)
    new_vect = ScaleModel.transform(clust_df.as_matrix()[0])
    reduced_data = reduced_model.transform(new_vect)
    Group = final_fit.predict(reduced_data)
    clust_df['kmeans_label'] = Group[0]

    Predrookie_df = pd.read_pickle('nba_bballref_rookie_stats_2016_Mar_15.pkl')
    Predrookie_df = Predrookie_df[Predrookie_df['Name']==PlayerName]
    Predrookie_df = Predrookie_df.drop(['Year','Career Games','Name'],1)
    predX = Predrookie_df.as_matrix() #take data out of dataframe
    predX = sm.add_constant(predX,has_constant='add')  # Adds a constant term to the predictor
    prstd_ols, iv_l_ols, iv_u_ols = wls_prediction_std(estHold[Group[0]],predX,alpha=0.05)
    return {'Name':PlayerName,'Group':Group[0]+1,'Prediction':estHold[Group[0]].predict(predX),'Upper_CI':iv_u_ols,'Lower_CI':iv_l_ols}

Here I create a function that creates a list of all the first round draft picks from a given year.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def gather_draftData(Year):

    import urllib2
    from bs4 import BeautifulSoup
    import pandas as pd
    import numpy as np

    draft_len = 30

    def convert_float(val):
        try:
            return float(val)
        except ValueError:
            return np.nan

    url = 'http://www.basketball-reference.com/draft/NBA_'+str(Year)+'.html'
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html,"lxml")

    draft_num = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[0].text for i in range(draft_len)]
    draft_nam = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[3].text for i in range(draft_len)]

    draft_df = pd.DataFrame([draft_num,draft_nam]).T
    draft_df.columns = ['Number','Name']
    df.index = range(np.size(df,0))
    return draft_df

Below I create predictions for each first-round draft pick from 2015. The spurs’ first round pick, Nikola Milutinov, has yet to play so I do not create a prediction for him.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import matplotlib.patches as mpatches

draft_df = gather_draftData(2015)

draft_df['Name'][14] =  'Kelly Oubre Jr.' #annoying name inconsistencies

plt.subplots(figsize=(14,6));
plt.xticks(range(1,31),draft_df['Name'],rotation=90)

draft_df = draft_df.drop(17, 0) #Sam Dekker has received little playing time making his prediction highly erratic
draft_df = draft_df.drop(25, 0) #spurs' 1st round pick has not played yet

for name in draft_df['Name']:

    draft_num = draft_df[draft_df['Name']==name]['Number']

    predict_dict = player_prediction__regressionModel(name)
    yerr = (predict_dict['Upper_CI']-predict_dict['Lower_CI'])/2

    plt.errorbar(draft_num,predict_dict['Prediction'],fmt='o',label=name,
                color=plt.rcParams['axes.color_cycle'][predict_dict['Group']-1],yerr=yerr);

plt.xlim(left=0,right=31)
patch = [mpatches.Patch(color=plt.rcParams['axes.color_cycle'][i], label='Group %d'%(i+1)) for i in range(6)]
plt.legend(handles=patch,ncol=3)
plt.ylabel('Predicted WS/48')
plt.xlabel('Draft Position');

The plot above is ordered by draft pick. The error bars depict 95% confidence interbals…which are a little wider than I would like. It’s interesting to look at what clusters these players fit into. Lots of 3-pt shooters! It could be that rookies play a limited role in the offense - just shooting 3s.

As a t-wolves fan, I am relatively happy about the high prediction for Karl-Anthony Towns. His predicted ws/48 is between Marc Gasol and Elton Brand. Again, the CIs are quite wide, so the model says there’s a 95% chance he is somewhere between Lebron James ever and a player that averages less than 0.1 ws/48.

Karl-Anthony Towns would have the highest predicted ws/48 if it were not for Kevin Looney who the model loves. Kevin Looney has not seen much playing time though, which likely makes his prediction more erratic. Keep in mind I did not use draft position as a predictor in the model.

Sam Dekker has a pretty huge error bar, likely because of his limited playing time this year.

While I fed a ton of features into this model, it’s still just a linear regression. The simplicity of the model might prevent me from making more accurate predictions.

I’ve already started playing with some more complex models. If those work out well, I will post them here. I ended up sticking with a plain linear regression because my vast number of features is a little unwieldy in a more complex models. If you’re interested (and the models produce better results) check back in the future.

For now, these models explain between 40 and 70% of the variance in career ws/48 from only a player’s rookie year. Even predicting 30% of variance is pretty remarkable, so I don’t want to trash on this part of the model. Explaining 65% of the variance is pretty awesome. The model gives us a pretty accurate idea of how these “bench players” will perform. For instance, the future does not look bright for players like Emmanuel Mudiay and Tyus Jones. Not to say these players are doomed. The model assumes that players will retain their grouping for the entire career. Emmanuel Mudiay and Tyus Jones might start performing more like distributors as their career progresses. This could result in a better career.

One nice part about this model is it tells us where the predictions are less confident. For instance, it is nice to know that we’re relatively confident when predicting bench players, but not when we’re predicting 3-point shooters.

For those curious, I output each groups regression summary below.

1
[print(i.summary()) for i in estHold];
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  WS/48   R-squared:                       0.648
Model:                            OLS   Adj. R-squared:                  0.575
Method:                 Least Squares   F-statistic:                     8.939
Date:                Sun, 20 Mar 2016   Prob (F-statistic):           2.33e-24
Time:                        10:40:28   Log-Likelihood:                 493.16
No. Observations:                 212   AIC:                            -912.3
Df Residuals:                     175   BIC:                            -788.1
Df Model:                          36                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -0.1072      0.064     -1.682      0.094        -0.233     0.019
x1             0.0012      0.001      0.925      0.356        -0.001     0.004
x2            -0.0005      0.000     -2.355      0.020        -0.001 -7.53e-05
x3            -0.0005      0.000     -1.899      0.059        -0.001  2.03e-05
x4          3.753e-05   1.27e-05      2.959      0.004      1.25e-05  6.26e-05
x5            -0.1152      0.088     -1.315      0.190        -0.288     0.058
x6             0.0240      0.053      0.456      0.649        -0.080     0.128
x7            -0.4318      0.372     -1.159      0.248        -1.167     0.303
x8             0.0089      0.085      0.105      0.917        -0.159     0.177
x9            -0.0479      0.054     -0.893      0.373        -0.154     0.058
x10           -0.0055      0.021     -0.265      0.792        -0.046     0.035
x11           -0.0011      0.076     -0.015      0.988        -0.152     0.149
x12           -0.0301      0.053     -0.569      0.570        -0.134     0.074
x13            0.7814      0.270      2.895      0.004         0.249     1.314
x14           -0.0323      0.028     -1.159      0.248        -0.087     0.023
x15           -0.0108      0.007     -1.451      0.149        -0.025     0.004
x16           -0.0202      0.030     -0.676      0.500        -0.079     0.039
x17           -0.0461      0.039     -1.172      0.243        -0.124     0.032
x18           -0.0178      0.040     -0.443      0.659        -0.097     0.062
x19            0.0450      0.038      1.178      0.240        -0.030     0.121
x20            0.0354      0.014      2.527      0.012         0.008     0.063
x21           -0.0418      0.044     -0.947      0.345        -0.129     0.045
x22           -0.0224      0.015     -1.448      0.150        -0.053     0.008
x23           -0.0158      0.008     -2.039      0.043        -0.031    -0.001
x24            0.0058      0.001      4.261      0.000         0.003     0.009
x25            0.0577      0.027      2.112      0.036         0.004     0.112
x26           -0.1913      0.267     -0.718      0.474        -0.717     0.335
x27           -0.0050      0.093     -0.054      0.957        -0.189     0.179
x28           -0.0133      0.039     -0.344      0.731        -0.090     0.063
x29           -0.0071      0.015     -0.480      0.632        -0.036     0.022
x30           -0.0190      0.010     -1.973      0.050        -0.038  5.68e-06
x31            0.0221      0.023      0.951      0.343        -0.024     0.068
x32           -0.0083      0.003     -2.490      0.014        -0.015    -0.002
x33            0.0386      0.031      1.259      0.210        -0.022     0.099
x34            0.0153      0.008      1.819      0.071        -0.001     0.032
x35        -1.734e-05      0.001     -0.014      0.989        -0.002     0.002
x36            0.0033      0.004      0.895      0.372        -0.004     0.011
==============================================================================
Omnibus:                        2.457   Durbin-Watson:                   2.144
Prob(Omnibus):                  0.293   Jarque-Bera (JB):                2.475
Skew:                           0.007   Prob(JB):                        0.290
Kurtosis:                       3.529   Cond. No.                     1.78e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.78e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  WS/48   R-squared:                       0.443
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     4.307
Date:                Sun, 20 Mar 2016   Prob (F-statistic):           1.67e-11
Time:                        10:40:28   Log-Likelihood:                 447.99
No. Observations:                 232   AIC:                            -822.0
Df Residuals:                     195   BIC:                            -694.4
Df Model:                          36                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -0.0532      0.090     -0.594      0.553        -0.230     0.124
x1            -0.0020      0.002     -1.186      0.237        -0.005     0.001
x2            -0.0006      0.000     -1.957      0.052        -0.001  4.47e-06
x3            -0.0007      0.000     -2.559      0.011        -0.001    -0.000
x4          5.589e-05   1.39e-05      4.012      0.000      2.84e-05  8.34e-05
x5             0.0386      0.093      0.414      0.679        -0.145     0.222
x6            -0.0721      0.051     -1.407      0.161        -0.173     0.029
x7            -0.6259      0.571     -1.097      0.274        -1.751     0.499
x8            -0.0653      0.079     -0.822      0.412        -0.222     0.091
x9             0.0756      0.051      1.485      0.139        -0.025     0.176
x10           -0.0046      0.031     -0.149      0.881        -0.066     0.057
x11           -0.0365      0.066     -0.554      0.580        -0.166     0.093
x12            0.0679      0.051      1.332      0.185        -0.033     0.169
x13            0.0319      0.183      0.174      0.862        -0.329     0.393
x14            0.0106      0.040      0.262      0.793        -0.069     0.090
x15           -0.0232      0.017     -1.357      0.176        -0.057     0.011
x16           -0.1121      0.039     -2.869      0.005        -0.189    -0.035
x17           -0.0675      0.060     -1.134      0.258        -0.185     0.050
x18           -0.0314      0.059     -0.536      0.593        -0.147     0.084
x19            0.0266      0.055      0.487      0.627        -0.081     0.134
x20            0.0259      0.009      2.827      0.005         0.008     0.044
x21           -0.0155      0.050     -0.307      0.759        -0.115     0.084
x22            0.1170      0.051      2.281      0.024         0.016     0.218
x23           -0.0157      0.014     -1.102      0.272        -0.044     0.012
x24            0.0021      0.003      0.732      0.465        -0.003     0.008
x25           -0.0012      0.038     -0.032      0.974        -0.077     0.075
x26            0.8379      0.524      1.599      0.111        -0.196     1.871
x27           -0.0511      0.113     -0.454      0.651        -0.273     0.171
x28            0.0944      0.111      0.852      0.395        -0.124     0.313
x29           -0.0018      0.029     -0.061      0.951        -0.059     0.055
x30           -0.0167      0.017     -0.969      0.334        -0.051     0.017
x31            0.0377      0.044      0.854      0.394        -0.049     0.125
x32           -0.0052      0.002     -2.281      0.024        -0.010    -0.001
x33            0.0132      0.037      0.360      0.719        -0.059     0.086
x34           -0.0650      0.028     -2.356      0.019        -0.119    -0.011
x35           -0.0012      0.002     -0.668      0.505        -0.005     0.002
x36            0.0087      0.008      1.107      0.270        -0.007     0.024
==============================================================================
Omnibus:                        2.161   Durbin-Watson:                   2.000
Prob(Omnibus):                  0.339   Jarque-Bera (JB):                1.942
Skew:                           0.222   Prob(JB):                        0.379
Kurtosis:                       3.067   Cond. No.                     3.94e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.94e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  WS/48   R-squared:                       0.358
Model:                            OLS   Adj. R-squared:                  0.270
Method:                 Least Squares   F-statistic:                     4.050
Date:                Sun, 20 Mar 2016   Prob (F-statistic):           1.93e-11
Time:                        10:40:28   Log-Likelihood:                 645.12
No. Observations:                 298   AIC:                            -1216.
Df Residuals:                     261   BIC:                            -1079.
Df Model:                          36                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.0306      0.040      0.763      0.446        -0.048     0.110
x1            -0.0013      0.001     -1.278      0.202        -0.003     0.001
x2            -0.0003      0.000     -1.889      0.060        -0.001  1.39e-05
x3            -0.0002      0.000     -1.196      0.233        -0.001     0.000
x4          2.388e-05   8.83e-06      2.705      0.007       6.5e-06  4.13e-05
x5            -0.0643      0.089     -0.724      0.470        -0.239     0.111
x6             0.0131      0.046      0.286      0.775        -0.077     0.103
x7            -0.4703      0.455     -1.034      0.302        -1.366     0.426
x8             0.0194      0.089      0.219      0.827        -0.155     0.194
x9            -0.0330      0.052     -0.638      0.524        -0.135     0.069
x10           -0.0221      0.013     -1.754      0.081        -0.047     0.003
x11            0.0161      0.074      0.216      0.829        -0.130     0.162
x12           -0.0228      0.047     -0.489      0.625        -0.115     0.069
x13            0.2619      0.423      0.620      0.536        -0.570     1.094
x14           -0.0303      0.027     -1.136      0.257        -0.083     0.022
x15           -0.0023      0.003     -0.895      0.372        -0.007     0.003
x16            0.0005      0.023      0.021      0.983        -0.045     0.046
x17            0.0206      0.040      0.513      0.608        -0.059     0.100
x18            0.0507      0.040      1.271      0.205        -0.028     0.129
x19           -0.0349      0.037     -0.942      0.347        -0.108     0.038
x20            0.0210      0.017      1.252      0.212        -0.012     0.054
x21            0.0400      0.041      0.964      0.336        -0.042     0.122
x22           -0.0239      0.009     -2.530      0.012        -0.042    -0.005
x23           -0.0140      0.008     -1.683      0.094        -0.030     0.002
x24            0.0045      0.001      4.594      0.000         0.003     0.006
x25            0.0264      0.026      1.004      0.316        -0.025     0.078
x26            0.2730      0.169      1.615      0.107        -0.060     0.606
x27           -0.0208      0.187     -0.111      0.912        -0.389     0.348
x28           -0.0007      0.015     -0.051      0.959        -0.029     0.028
x29            0.0168      0.018      0.917      0.360        -0.019     0.053
x30            0.0059      0.011      0.524      0.601        -0.016     0.028
x31           -0.0196      0.028     -0.711      0.478        -0.074     0.035
x32           -0.0035      0.004     -0.899      0.370        -0.011     0.004
x33           -0.0246      0.029     -0.858      0.392        -0.081     0.032
x34            0.0145      0.005      2.903      0.004         0.005     0.024
x35           -0.0017      0.001     -1.442      0.150        -0.004     0.001
x36            0.0069      0.005      1.514      0.131        -0.002     0.016
==============================================================================
Omnibus:                        5.509   Durbin-Watson:                   1.845
Prob(Omnibus):                  0.064   Jarque-Bera (JB):                5.309
Skew:                           0.272   Prob(JB):                       0.0703
Kurtosis:                       3.362   Cond. No.                     3.70e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.7e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  WS/48   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.248
Method:                 Least Squares   F-statistic:                     5.452
Date:                Sun, 20 Mar 2016   Prob (F-statistic):           4.41e-19
Time:                        10:40:28   Log-Likelihood:                 1030.4
No. Observations:                 486   AIC:                            -1987.
Df Residuals:                     449   BIC:                            -1832.
Df Model:                          36                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.1082      0.033      3.280      0.001         0.043     0.173
x1            -0.0018      0.001     -2.317      0.021        -0.003    -0.000
x2            -0.0005      0.000     -3.541      0.000        -0.001    -0.000
x3          4.431e-05      0.000      0.359      0.720        -0.000     0.000
x4           1.71e-05   6.08e-06      2.813      0.005      5.15e-06   2.9e-05
x5             0.0257      0.044      0.580      0.562        -0.061     0.113
x6             0.0133      0.029      0.464      0.643        -0.043     0.070
x7            -0.5271      0.357     -1.476      0.141        -1.229     0.175
x8             0.0415      0.038      1.090      0.277        -0.033     0.116
x9            -0.0117      0.029     -0.409      0.682        -0.068     0.044
x10            0.0031      0.018      0.171      0.865        -0.032     0.038
x11            0.0253      0.031      0.819      0.413        -0.035     0.086
x12           -0.0196      0.028     -0.687      0.492        -0.076     0.036
x13            0.0360      0.067      0.535      0.593        -0.096     0.168
x14            0.0096      0.021      0.461      0.645        -0.031     0.050
x15            0.0101      0.009      1.165      0.245        -0.007     0.027
x16            0.0227      0.015      1.556      0.120        -0.006     0.051
x17            0.0413      0.034      1.198      0.232        -0.026     0.109
x18            0.0195      0.031      0.623      0.533        -0.042     0.081
x19           -0.0267      0.029     -0.906      0.366        -0.085     0.031
x20            0.0199      0.008      2.652      0.008         0.005     0.035
x21           -0.0442      0.033     -1.325      0.186        -0.110     0.021
x22            0.0232      0.025      0.946      0.345        -0.025     0.072
x23            0.0085      0.009      0.976      0.330        -0.009     0.026
x24            0.0025      0.001      1.782      0.075        -0.000     0.005
x25           -0.0200      0.019     -1.042      0.298        -0.058     0.018
x26            0.4937      0.331      1.491      0.137        -0.157     1.144
x27           -0.1406      0.074     -1.907      0.057        -0.286     0.004
x28           -0.0638      0.049     -1.304      0.193        -0.160     0.032
x29           -0.0252      0.015     -1.690      0.092        -0.055     0.004
x30           -0.0217      0.008     -2.668      0.008        -0.038    -0.006
x31            0.0483      0.020      2.387      0.017         0.009     0.088
x32           -0.0036      0.002     -2.159      0.031        -0.007    -0.000
x33            0.0388      0.023      1.681      0.094        -0.007     0.084
x34           -0.0105      0.011     -0.923      0.357        -0.033     0.012
x35           -0.0028      0.001     -1.966      0.050        -0.006 -1.59e-06
x36           -0.0017      0.003     -0.513      0.608        -0.008     0.005
==============================================================================
Omnibus:                        5.317   Durbin-Watson:                   2.030
Prob(Omnibus):                  0.070   Jarque-Bera (JB):                5.115
Skew:                           0.226   Prob(JB):                       0.0775
Kurtosis:                       3.221   Cond. No.                     4.51e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.51e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  WS/48   R-squared:                       0.455
Model:                            OLS   Adj. R-squared:                  0.378
Method:                 Least Squares   F-statistic:                     5.852
Date:                Sun, 20 Mar 2016   Prob (F-statistic):           4.77e-18
Time:                        10:40:28   Log-Likelihood:                 631.81
No. Observations:                 289   AIC:                            -1190.
Df Residuals:                     252   BIC:                            -1054.
Df Model:                          36                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.1755      0.096      1.827      0.069        -0.014     0.365
x1            -0.0031      0.001     -2.357      0.019        -0.006    -0.001
x2            -0.0005      0.000     -2.424      0.016        -0.001 -8.68e-05
x3            -0.0003      0.000     -2.154      0.032        -0.001  -2.9e-05
x4          2.374e-05   8.35e-06      2.842      0.005      7.29e-06  4.02e-05
x5             0.0391      0.070      0.556      0.579        -0.099     0.177
x6             0.0672      0.040      1.662      0.098        -0.012     0.147
x7             0.9503      0.458      2.075      0.039         0.048     1.852
x8            -0.0013      0.061     -0.021      0.983        -0.122     0.119
x9            -0.0270      0.041     -0.659      0.510        -0.108     0.054
x10           -0.0072      0.017     -0.426      0.671        -0.041     0.026
x11            0.0604      0.056      1.083      0.280        -0.049     0.170
x12           -0.0723      0.041     -1.782      0.076        -0.152     0.008
x13           -1.2499      0.392     -3.186      0.002        -2.022    -0.477
x14            0.0502      0.028      1.776      0.077        -0.005     0.106
x15            0.0048      0.011      0.456      0.649        -0.016     0.026
x16           -0.0637      0.042     -1.530      0.127        -0.146     0.018
x17            0.0042      0.038      0.112      0.911        -0.070     0.078
x18            0.0318      0.038      0.830      0.408        -0.044     0.107
x19           -0.0220      0.037     -0.602      0.548        -0.094     0.050
x20        -4.535e-05      0.009     -0.005      0.996        -0.018     0.018
x21           -0.0176      0.040     -0.440      0.660        -0.097     0.061
x22           -0.0244      0.021     -1.182      0.238        -0.065     0.016
x23            0.0135      0.012      1.128      0.260        -0.010     0.037
x24            0.0024      0.002      1.355      0.177        -0.001     0.006
x25           -0.0418      0.026     -1.583      0.115        -0.094     0.010
x26            0.3619      0.328      1.105      0.270        -0.283     1.007
x27            0.0090      0.186      0.049      0.961        -0.358     0.376
x28           -0.0613      0.057     -1.068      0.286        -0.174     0.052
x29            0.0124      0.016      0.779      0.436        -0.019     0.044
x30            0.0042      0.011      0.379      0.705        -0.018     0.026
x31           -0.0108      0.026     -0.412      0.681        -0.062     0.041
x32            0.0014      0.002      0.588      0.557        -0.003     0.006
x33            0.0195      0.029      0.672      0.502        -0.038     0.077
x34            0.0168      0.011      1.554      0.121        -0.004     0.038
x35           -0.0026      0.002     -1.227      0.221        -0.007     0.002
x36           -0.0072      0.004     -1.958      0.051        -0.014  4.02e-05
==============================================================================
Omnibus:                        4.277   Durbin-Watson:                   1.995
Prob(Omnibus):                  0.118   Jarque-Bera (JB):                4.056
Skew:                           0.226   Prob(JB):                        0.132
Kurtosis:                       3.364   Cond. No.                     4.24e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.24e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  WS/48   R-squared:                       0.476
Model:                            OLS   Adj. R-squared:                  0.337
Method:                 Least Squares   F-statistic:                     3.431
Date:                Sun, 20 Mar 2016   Prob (F-statistic):           1.19e-07
Time:                        10:40:28   Log-Likelihood:                 330.36
No. Observations:                 173   AIC:                            -586.7
Df Residuals:                     136   BIC:                            -470.1
Df Model:                          36                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.1822      0.262      0.696      0.488        -0.335     0.700
x1            -0.0011      0.002     -0.491      0.624        -0.005     0.003
x2             0.0001      0.000      0.310      0.757        -0.001     0.001
x3          6.743e-05      0.000      0.220      0.827        -0.001     0.001
x4          5.819e-06   1.63e-05      0.357      0.722     -2.65e-05  3.81e-05
x5             0.0618      0.122      0.507      0.613        -0.179     0.303
x6             0.0937      0.074      1.272      0.206        -0.052     0.240
x7             0.8422      0.919      0.917      0.361        -0.975     2.659
x8            -0.1109      0.111     -1.001      0.319        -0.330     0.108
x9            -0.1334      0.075     -1.767      0.079        -0.283     0.016
x10           -0.0357      0.024     -1.500      0.136        -0.083     0.011
x11           -0.1373      0.103     -1.335      0.184        -0.341     0.066
x12           -0.1002      0.075     -1.329      0.186        -0.249     0.049
x13           -0.2963      0.616     -0.481      0.631        -1.515     0.922
x14           -0.0278      0.047     -0.588      0.557        -0.121     0.066
x15           -0.0099      0.015     -0.661      0.510        -0.040     0.020
x16            0.1532      0.106      1.444      0.151        -0.057     0.363
x17           -0.1569      0.072     -2.168      0.032        -0.300    -0.014
x18           -0.1633      0.068     -2.385      0.018        -0.299    -0.028
x19            0.1550      0.066      2.356      0.020         0.025     0.285
x20           -0.0114      0.017     -0.688      0.492        -0.044     0.021
x21           -0.0130      0.076     -0.170      0.865        -0.164     0.138
x22           -0.0202      0.024     -0.857      0.393        -0.067     0.026
x23           -0.0203      0.028     -0.737      0.462        -0.075     0.034
x24           -0.0023      0.004     -0.608      0.544        -0.010     0.005
x25            0.0546      0.048      1.141      0.256        -0.040     0.149
x26           -1.0180      0.714     -1.426      0.156        -2.430     0.394
x27            0.3371      0.203      1.664      0.098        -0.064     0.738
x28            0.1286      0.140      0.916      0.361        -0.149     0.406
x29           -0.0561      0.035     -1.607      0.110        -0.125     0.013
x30           -0.0535      0.020     -2.645      0.009        -0.093    -0.013
x31            0.1169      0.051      2.305      0.023         0.017     0.217
x32            0.0039      0.004      1.030      0.305        -0.004     0.011
x33            0.0179      0.055      0.324      0.746        -0.091     0.127
x34            0.0081      0.013      0.632      0.529        -0.017     0.033
x35            0.0013      0.006      0.229      0.819        -0.010     0.013
x36           -0.0068      0.007     -1.045      0.298        -0.020     0.006
==============================================================================
Omnibus:                        2.969   Durbin-Watson:                   2.098
Prob(Omnibus):                  0.227   Jarque-Bera (JB):                2.526
Skew:                           0.236   Prob(JB):                        0.283
Kurtosis:                       3.357   Cond. No.                     6.96e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.96e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Repeated Measures ANOVA in Python (Kinda)

If you’re just finding this post, please check out Erik Marsja’s post describing the same functionality in well-maintained python software that wasn’t available when I originally wrote this post.

I love doing data analyses with pandas, numpy, sci-py etc., but I often need to run repeated measures ANOVAs, which are not implemented in any major python libraries. Python Psychologist shows how to do repeated measures ANOVAs yourself in python, but I find using a widley distributed implementation comforting…

In this post I show how to execute a repeated measures ANOVAs using the rpy2 library, which allows us to move data between python and R, and execute R commands from python. I use rpy2 to load the car library and run the ANOVA.

I will show how to run a one-way repeated measures ANOVA and a two-way repeated measures ANOVA.

1
2
3
4
5
6
7
8
9
10
#first import the libraries I always use.
import numpy as np, scipy.stats, pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import pylab as pl
%matplotlib inline
pd.options.display.mpl_style = 'default'
plt.style.use('ggplot')
mpl.rcParams['font.family'] = ['Bitstream Vera Sans']

Below I use the random library to generate some fake data. I seed the random number generator with a one so that this analysis can be replicated.

I will generated 3 conditions which represent 3 levels of a single variable.

The data are generated from a gaussian distribution. The second condition has a higher mean than the other two conditions.

1
2
3
4
5
6
7
8
9
import random

random.seed(1) #seed random number generator
cond_1 = [random.gauss(600,30) for x in range(30)] #condition 1 has a mean of 600 and standard deviation of 30
cond_2 = [random.gauss(650,30) for x in range(30)] #u=650 and sd=30
cond_3 = [random.gauss(600,30) for x in range(30)] #u=600 and sd=30

plt.bar(np.arange(1,4),[np.mean(cond_1),np.mean(cond_2),np.mean(cond_3)],align='center') #plot data
plt.xticks([1,2,3]);

Next, I load rpy2 for ipython. I am doing these analyses with ipython in a jupyter notebook (highly recommended).

1
%load_ext rpy2.ipython

Here’s how to run the ANOVA. Note that this is a one-way anova with 3 levels of the factor.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#pop the data into R
%Rpush cond_1 cond_2 cond_3

#label the conditions
%R Factor <- c('Cond1','Cond2','Cond3')
#create a vector of conditions
%R idata <- data.frame(Factor)

#combine data into single matrix
%R Bind <- cbind(cond_1,cond_2,cond_3)
#generate linear model
%R model <- lm(Bind~1)

#load the car library. note this library must be installed.
%R library(car)
#run anova
%R analysis <- Anova(model,idata=idata,idesign=~Factor,type="III")
#create anova summary table
%R anova_sum = summary(analysis)

#move the data from R to python
%Rpull anova_sum
print anova_sum
Type III Repeated Measures MANOVA Tests:

------------------------------------------

Term: (Intercept)

 Response transformation matrix:
       (Intercept)
cond_1           1
cond_2           1
cond_3           1

Sum of squares and products for the hypothesis:
            (Intercept)
(Intercept)   102473990

Sum of squares and products for error:
            (Intercept)
(Intercept)     78712.7

Multivariate Tests: (Intercept)
                 Df test stat approx F num Df den Df     Pr(>F)    
Pillai            1    0.9992 37754.33      1     29 < 2.22e-16 ***
Wilks             1    0.0008 37754.33      1     29 < 2.22e-16 ***
Hotelling-Lawley  1 1301.8736 37754.33      1     29 < 2.22e-16 ***
Roy               1 1301.8736 37754.33      1     29 < 2.22e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

------------------------------------------

Term: Factor

 Response transformation matrix:
       Factor1 Factor2
cond_1       1       0
cond_2       0       1
cond_3      -1      -1

Sum of squares and products for the hypothesis:
          Factor1   Factor2
Factor1  3679.584  19750.87
Factor2 19750.870 106016.58

Sum of squares and products for error:
         Factor1  Factor2
Factor1 40463.19 27139.59
Factor2 27139.59 51733.12

Multivariate Tests: Factor
                 Df test stat approx F num Df den Df    Pr(>F)    
Pillai            1 0.7152596 35.16759      2     28 2.303e-08 ***
Wilks             1 0.2847404 35.16759      2     28 2.303e-08 ***
Hotelling-Lawley  1 2.5119704 35.16759      2     28 2.303e-08 ***
Roy               1 2.5119704 35.16759      2     28 2.303e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Univariate Type III Repeated-Measures ANOVA Assuming Sphericity

                  SS num Df Error SS den Df         F    Pr(>F)    
(Intercept) 34157997      1    26238     29 37754.334 < 2.2e-16 ***
Factor         59964      2    43371     58    40.094 1.163e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Mauchly Tests for Sphericity

       Test statistic p-value
Factor        0.96168 0.57866


Greenhouse-Geisser and Huynh-Feldt Corrections
 for Departure from Sphericity

        GG eps Pr(>F[GG])    
Factor 0.96309  2.595e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

        HF eps   Pr(>F[HF])
Factor 1.03025 1.163294e-11

The ANOVA table isn’t pretty, but it works. As you can see, the ANOVA was wildly significant.

Next, I generate data for a two-way (2x3) repeated measures ANOVA. Condition A is the same data as above. Condition B has a different pattern (2 is lower than 1 and 3), which should produce an interaction.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
random.seed(1)

cond_1a = [random.gauss(600,30) for x in range(30)] #u=600,sd=30
cond_2a = [random.gauss(650,30) for x in range(30)] #u=650,sd=30
cond_3a = [random.gauss(600,30) for x in range(30)] #u=600,sd=30

cond_1b = [random.gauss(600,30) for x in range(30)] #u=600,sd=30
cond_2b = [random.gauss(550,30) for x in range(30)] #u=550,sd=30
cond_3b = [random.gauss(650,30) for x in range(30)] #u=650,sd=30

width = 0.25
plt.bar(np.arange(1,4)-width,[np.mean(cond_1a),np.mean(cond_2a),np.mean(cond_3a)],width)
plt.bar(np.arange(1,4),[np.mean(cond_1b),np.mean(cond_2b),np.mean(cond_3b)],width,color=plt.rcParams['axes.color_cycle'][0])
plt.legend(['A','B'],loc=4)
plt.xticks([1,2,3]);

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
%Rpush cond_1a cond_1b cond_2a cond_2b cond_3a cond_3b

%R Factor1 <- c('A','A','A','B','B','B')
%R Factor2 <- c('Cond1','Cond2','Cond3','Cond1','Cond2','Cond3')
%R idata <- data.frame(Factor1, Factor2)

#make sure the vectors appear in the same order as they appear in the dataframe
%R Bind <- cbind(cond_1a, cond_2a, cond_3a, cond_1b, cond_2b, cond_3b)
%R model <- lm(Bind~1)

%R library(car)
%R analysis <- Anova(model, idata=idata, idesign=~Factor1*Factor2, type="III")
%R anova_sum = summary(analysis)
%Rpull anova_sum

print anova_sum
Type III Repeated Measures MANOVA Tests:

------------------------------------------

Term: (Intercept)

 Response transformation matrix:
        (Intercept)
cond_1a           1
cond_2a           1
cond_3a           1
cond_1b           1
cond_2b           1
cond_3b           1

Sum of squares and products for the hypothesis:
            (Intercept)
(Intercept)   401981075

Sum of squares and products for error:
            (Intercept)
(Intercept)    185650.5

Multivariate Tests: (Intercept)
                 Df test stat approx F num Df den Df     Pr(>F)    
Pillai            1    0.9995 62792.47      1     29 < 2.22e-16 ***
Wilks             1    0.0005 62792.47      1     29 < 2.22e-16 ***
Hotelling-Lawley  1 2165.2575 62792.47      1     29 < 2.22e-16 ***
Roy               1 2165.2575 62792.47      1     29 < 2.22e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

------------------------------------------

Term: Factor1

 Response transformation matrix:
        Factor11
cond_1a        1
cond_2a        1
cond_3a        1
cond_1b       -1
cond_2b       -1
cond_3b       -1

Sum of squares and products for the hypothesis:
         Factor11
Factor11 38581.51

Sum of squares and products for error:
         Factor11
Factor11 142762.3

Multivariate Tests: Factor1
                 Df test stat approx F num Df den Df    Pr(>F)   
Pillai            1 0.2127533 7.837247      1     29 0.0090091 **
Wilks             1 0.7872467 7.837247      1     29 0.0090091 **
Hotelling-Lawley  1 0.2702499 7.837247      1     29 0.0090091 **
Roy               1 0.2702499 7.837247      1     29 0.0090091 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

------------------------------------------

Term: Factor2

 Response transformation matrix:
        Factor21 Factor22
cond_1a        1        0
cond_2a        0        1
cond_3a       -1       -1
cond_1b        1        0
cond_2b        0        1
cond_3b       -1       -1

Sum of squares and products for the hypothesis:
         Factor21 Factor22
Factor21 91480.01 77568.78
Factor22 77568.78 65773.02

Sum of squares and products for error:
         Factor21 Factor22
Factor21 90374.60 56539.06
Factor22 56539.06 87589.85

Multivariate Tests: Factor2
                 Df test stat approx F num Df den Df    Pr(>F)    
Pillai            1 0.5235423 15.38351      2     28 3.107e-05 ***
Wilks             1 0.4764577 15.38351      2     28 3.107e-05 ***
Hotelling-Lawley  1 1.0988223 15.38351      2     28 3.107e-05 ***
Roy               1 1.0988223 15.38351      2     28 3.107e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

------------------------------------------

Term: Factor1:Factor2

 Response transformation matrix:
        Factor11:Factor21 Factor11:Factor22
cond_1a                 1                 0
cond_2a                 0                 1
cond_3a                -1                -1
cond_1b                -1                 0
cond_2b                 0                -1
cond_3b                 1                 1

Sum of squares and products for the hypothesis:
                  Factor11:Factor21 Factor11:Factor22
Factor11:Factor21          179585.9            384647
Factor11:Factor22          384647.0            823858

Sum of squares and products for error:
                  Factor11:Factor21 Factor11:Factor22
Factor11:Factor21          92445.33          45639.49
Factor11:Factor22          45639.49          89940.37

Multivariate Tests: Factor1:Factor2
                 Df test stat approx F num Df den Df     Pr(>F)    
Pillai            1  0.901764 128.5145      2     28 7.7941e-15 ***
Wilks             1  0.098236 128.5145      2     28 7.7941e-15 ***
Hotelling-Lawley  1  9.179605 128.5145      2     28 7.7941e-15 ***
Roy               1  9.179605 128.5145      2     28 7.7941e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Univariate Type III Repeated-Measures ANOVA Assuming Sphericity

                      SS num Df Error SS den Df          F    Pr(>F)    
(Intercept)     66996846      1    30942     29 62792.4662 < 2.2e-16 ***
Factor1             6430      1    23794     29     7.8372  0.009009 **
Factor2            26561      2    40475     58    19.0310  4.42e-07 ***
Factor1:Factor2   206266      2    45582     58   131.2293 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Mauchly Tests for Sphericity

                Test statistic p-value
Factor2                0.96023 0.56654
Factor1:Factor2        0.99975 0.99648


Greenhouse-Geisser and Huynh-Feldt Corrections
 for Departure from Sphericity

                 GG eps Pr(>F[GG])    
Factor2         0.96175  6.876e-07 ***
Factor1:Factor2 0.99975  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                  HF eps   Pr(>F[HF])
Factor2         1.028657 4.420005e-07
Factor1:Factor2 1.073774 2.965002e-22

Again, the anova table isn’t too pretty.

This obviously isn’t the most exciting post in the world, but its a nice bit of code to have in your back pocket if you’re doing experimental analyses in python.

Grouping NBA Players

In basketball, we typically talk about 5 positions: point guard, shooting guard, small forward, power forward, and center. Based on this, one might expect NBA players to fall into 5 distinct groups- Point guards perform similar to other point guards, shooting guards perform similar to other shooting guards, etc. Is this the case? Do NBA players fall neatly into position groups?

To answer this question, I will look at how NBA players “group” together. For example, there might be a group of players who collect lots of rebounds, shoot poorly from behind the 3 point line, and block lots of shots. I might call these players forwards. If we allow player performance to create groups, what will these groups look like?

To group players, I will use k-means clustering (https://en.wikipedia.org/wiki/K-means_clustering).

When choosing a clustering algorithm, its important to think about how the clustering algorithm defines clusters. k-means minimizes the distance between data points (players in my case) and the center of K different points. Because distance is between the cluster center and a given point, k-means assumes clusters are spherical. When thinking about clusters of NBA players, do I think these clusters will be spherical? If not, then I might want try a different clustering algorithm.

For now, I will assume generally spherical clusters and use k-means. At the end of this post, I will comment on whether this assumption seems valid.

1
2
3
4
5
6
7
#import libraries and tell ipython we want inline figures rather than interactive figures.
import matplotlib.pyplot as plt, pandas as pd, numpy as np, matplotlib as mpl, requests, time

%matplotlib inline
pd.options.display.mpl_style = 'default' #load matplotlib for plotting
plt.style.use('ggplot') #im addicted to ggplot. so pretty.
mpl.rcParams['font.family'] = ['Bitstream Vera Sans']

We need data. Collecting the data will require a couple steps. First, I will create a matrix of all players who ever played in the NBA (via the NBA.com API).

1
2
3
4
5
6
url = 'http://stats.nba.com/stats/commonallplayers?IsOnlyCurrentSeason=0&LeagueID=00&Season=2015-16'
# the critical part of the URL is IsOnlyCurrentSeason=0. The 0 means all seasons.
response = requests.get(url) #get data
headers = response.json()['resultSets'][0]
players = response.json()['resultSets'][0]['rowSet']
players_df = pd.DataFrame(players, columns=headers['headers']) #turn data into dataframe

In the 1979-1980 season, the NBA started using the 3-point line. The 3-point has dramatically changed basketball, so players performed different before it. While this change in play was not instantaneous, it does not make sense to include players before the 3-point line.

1
2
players_df['TO_YEAR'] = players_df['TO_YEAR'].apply(lambda x:int(x)) #change data in the TO_YEAR column from strings to numbers
players_df = players_df[players_df['TO_YEAR']>1979] #only use players after 3 pt line

I have a list of all the players after 1979, but I want data about all these players. When grouping the players, I am not interested in how much a player played. Instead, I want to know HOW a player played. To remove variability associated with playing time, I will gather data that is standardized for 36 minutes of play. For example, if a player averages 4 points and 12 minutes a game, this player averages 12 points per 36 minutes.

Below, I have written a function that will collect every player’s performance per 36 minutes. The function collects data one player at a time, so its VERY slow. If you want the data, it can be found on my github (https://github.com/dvatterott/nba_project).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def careerPer36(playerID):
    url = 'http://stats.nba.com/stats/playercareerstats?LeagueID=00&PerMode=Per36&PlayerID='+str(playerID)
    header_data = { #I pulled this header from the py goldsberry library
            'Accept-Encoding': 'gzip, deflate, sdch',
            'Accept-Language': 'en-US,en;q=0.8',
            'Upgrade-Insecure-Requests': '1',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)'\
            ' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 '\
            'Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9'\
            ',image/webp,*/*;q=0.8',
            'Cache-Control': 'max-age=0',
            'Connection': 'keep-alive'
        }
    try:
        response = requests.get(url,headers = header_data) #get the data
        #print playerID #testing purposes
    except:
        time.sleep(5) #sometime the nba api gets mad about all the requests, so i take a quick break
        response = requests.get(url,headers = header_data)
    headers = response.json()['resultSets'][1] #find headers
    players = response.json()['resultSets'][1]['rowSet'] #actual data
    players_df = pd.DataFrame(players, columns=headers['headers']) #create dataframe
    return players_df

#df = pd.DataFrame()
#player_list = players_df['PERSON_ID']
#df = df.append([careerPer36(x) for x in player_list]) #BEWARE. this takes forever.
#df.index = range(np.size(df,0))
#df.to_pickle('nba_career_stats_'+time.strftime("%Y_%b_%d", time.gmtime())+'.pkl')
df = pd.read_pickle('nba_career_stats_2016_Feb_14.pkl')
1
df.columns
Index([u'PLAYER_ID', u'LEAGUE_ID',   u'TEAM_ID',        u'GP',        u'GS',
             u'MIN',       u'FGM',       u'FGA',    u'FG_PCT',      u'FG3M',
            u'FG3A',   u'FG3_PCT',       u'FTM',       u'FTA',    u'FT_PCT',
            u'OREB',      u'DREB',       u'REB',       u'AST',       u'STL',
             u'BLK',       u'TOV',        u'PF',       u'PTS'],
      dtype='object')

Great! Now we have data that is scaled for 36 minutes of play (per36 data) from every player between 1979 and 2016. Above, I printed out the columns. I don’t want all this data. For instance, I do not care about how many minutes a player played. Also, some of the data is redundant. For instance, if I know a player’s field goal attempts (FGA) and field goal percentage (FG_PCT), I can calculate the number of made field goals (FGM). I removed the data columns that seem redundant. I do this because I do not want redundant data exercising too much influence on the grouping process.

Below, I create new data columns for 2 point field goal attempts and 2 point field goal percentage. I also remove all players who played less than 50 games. I do this because these players have not had the opportunity to establish consistent performance.

1
2
3
4
5
6
7
df = df[df['GP']>50] #only players with more than 50 games.
df = df.fillna(0) #some players have "None" in some cells. Turn these into 0s
df['FG2M'] = df['FGM']-df['FG3M'] #find number of 2pt field goals
df['FG2A'] = df['FGA']-df['FG3A'] #2 point field goal attempts
df['FG2_PCT'] = df['FG2M']/df['FG2A'] #2 point field goal percentage
saveIDs = df['PLAYER_ID'] #save player IDs for later
df = df.drop(['PLAYER_ID','LEAGUE_ID','TEAM_ID','GP','GS','MIN','FGM','FGA','FG_PCT','FG3M','FTM','REB','PTS','FG2M'],1) #pull out unncessary columns

It’s always important to visualize the data, so lets get an idea what we’re working with!

The plot below is called a scatter matrix. This type of plot will appear again, so lets go through it carefully. Each subplot has the feature (stat) labeled on its row which serves as its y-axis. The column feature serves as the x-axis. For example the subplot in the second column of the first row plots 3-point field goal attempts by 3-point field goal percentage. As you can see, players that have higher 3-point percentages tend to take more 3-pointers… makes sense.

On the diagonals, I plot the Kernel Density Estimation for the sample histogram. More players fall into areas where where the line is higher on the y-axis. For instance, no players shoot better than ~45% from behind the 3 point line.

One interesting part about scatter matrices is the plots below the diagonal are a reflection of the plots above the diagonal. For example, the data in the second column of the first row and the first column of the second row are the same. The only difference is the axes have switched.

1
2
3
axs = pd.tools.plotting.scatter_matrix(df, alpha=0.2, figsize=(12, 12), diagonal='kde'); #the diagonal will show kernel density
[ax.set_yticks([]) for ax in axs[:,0]] #turn off the ticks that take up way too much space in such a crammed figure
[ax.set_xticks([]) for ax in axs[-1,:]];

There are a couple things to note in the graph above. First, there’s a TON of information there. Second, it looks like there are some strong correlations. For example, look at the subplots depicting offensive rebounds by defensive rebounds.

While I tried to throw out redundant data, I clearly did not throw out all redundant data. For example, players that are good 3-point shooters are probably also good free throw shooters. These players are simply good shooters, and being a good shooter contributes to multiple data columns above.

When I group the data, I do not want an ability such as shooting to contribute too much. I want to group players equally according to all their abilities. Below I use a PCA to seperate variance associated with the different “components” (e.g., shooting ability) of basketball performance.

For an explanation of PCA I recommend this link - https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

X = df.as_matrix() #take data out of dataframe
X = scale(X) #standardize the data before giving it to the PCA.
#I standardize the data because some features such as PF or steals have lower magnitudes than other features such as FG2A
#I want both to contribute equally to the PCA, so I make sure they're on the same scale.

pca = PCA() #great PCA object
pca.fit(X) #pull out principle components
var_expl = pca.explained_variance_ratio_ #find amount of variance explained by each component
tot_var_expl = np.array([sum(var_expl[0:i+1]) for i,x in enumerate(var_expl)]) #create vector with cumulative variance

plt.figure(figsize=(12,4)) #create cumulative proportion of variance plot
plt.subplot(1,2,1)
plt.plot(range(1,len(tot_var_expl)+1), tot_var_expl*100,'o-')
plt.axis([0, len(tot_var_expl)+1, 0, 100])
plt.xlabel('Number of PCA Components Included')
plt.ylabel('Percentage of variance explained (%)')

plt.subplot(1,2,2) #create scree plot
plt.plot(range(1,len(var_expl)+1), var_expl*100,'o-')
plt.axis([0, len(var_expl)+1, 0, 100])
plt.xlabel('PCA Component');

On the left, I plot the amount of variance explained after including each additional PCA component. Using all the components explains all the variability, but notice how little the last few components contribute. It doesn’t make sense to include a component that only explains 1% of the variability…but how many components to include!?

I chose to include the first 5 components because no component after the 5th explained more than 5% of the data. This part of the analysis is admittedly arbitrary, but 5% is a relatively conservative cut-off.

Below is the fun part of the data. We get to look at what features contribute to the different principle components.

  • Assists and 3-point shooting contribute to the first component. I will call this the Outside Skills component.
  • Free throw attempts, assists, turnovers and 2-point field goals contribute to the second component. I will call this the Rim Scoring component.
  • Free throw percentage and 2-point field goal percentage contribute to the third component. I will call this the Pure Points component.
  • 2-point field goal percentage and steals contribute to the fourth component. I will call this the Defensive Big Man component.
  • 3-point shooting and free throws contribute to the fifth component. I will call this the Dead Eye component.

One thing to keep in mind here is that each component explains less variance than the last. So while 3 point shooting contributes to both the 1st and 5th component, more 3 point shooting variability is probably explained by the 1st component.

It would be great if we had a PCA component that was only shooting and another that was only rebounding since we typically conceive these to be different skills. Yet, there are multiple aspects of each skill. For example, a 3-point shooter not only has to be a dead-eye shooter, but also has to find ways to get open. Additionally, being good at “getting open” might be something akin to basketball IQ which would also contribute to assists and steals!

1
2
3
factor_names = ['Outside Skills','Rim Scoring','Pure Points','Defensive Big Man','Dead Eye'] #my component names
loadings_df = pd.DataFrame(pca.components_, columns=df.columns)
#loadings_df[0:5] #all the factor loadings appear below.

Cool, we have our 5 PCA components. Now lets transform the data into our 5 component PCA space (from our 13 feature space - e.g., FG3A, FG3_PCT, ect.). To do this, we give each player a score on each of the 5 PCA components.

Next, I want to see how players cluster together based on their scores on these components. First, let’s investigate how using more or less clusters (i.e., groups) explains different amounts of variance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from scipy.spatial.distance import cdist, pdist, euclidean
from sklearn.cluster import KMeans
from sklearn import metrics


#http://stackoverflow.com/questions/6645895/calculating-the-percentage-of-variance-measure-for-k-means
#The above link was helpful when writing this code.

reduced_data = PCA(n_components=5, whiten=True).fit_transform(X) #transform data into the 5 PCA components space
#kmeans assumes clusters have equal variance, and whitening helps keep this assumption.

k_range = range(2,31) #looking amount of variance explained by 1 through 30 cluster
k_means_var = [KMeans(n_clusters=k).fit(reduced_data) for k in k_range] #fit kmeans with 1 cluster to 30 clusters

#get labels and calculate silhouette score
labels = [i.labels_ for i in k_means_var]
sil_score = [metrics.silhouette_score(reduced_data,i,metric='euclidean') for i in labels]

centroids = [i.cluster_centers_ for i in k_means_var] #get the center of each cluster
k_euclid = [cdist(reduced_data,cent,'euclidean') for cent in centroids] #calculate distance between each item and each cluster center
dist = [np.min(ke,axis=1) for ke in k_euclid] #get the distance between each item and its cluster

wcss = [sum(d**2) for d in dist] #within cluster sum of squares
tss = sum(pdist(reduced_data)**2/reduced_data.shape[0]) #total sum of squares
bss = tss-wcss #between cluster sum of squares

plt.clf()
plt.figure(figsize=(12,4)) #create cumulative proportion of variance plot
plt.subplot(1,2,1)
plt.plot(k_range, bss/tss*100,'o-')
plt.axis([0, np.max(k_range), 0, 100])
plt.xlabel('Number of Clusters')
plt.ylabel('Percentage of variance explained (%)');

plt.subplot(1,2,2) #create scree plot
plt.plot(k_range, np.transpose(sil_score)*100,'o-')
plt.axis([0, np.max(k_range), 0, 40])
plt.xlabel('Number of Clusters');
plt.ylabel('Average Silhouette Score*100');

As you can in the left hand plot, adding more clusters explains more of the variance, but there are diminishing returns. Each additional cluster explains a little less data than the last (much like each PCA component explained less variance than the previous component).

The particularly intersting point here is the point where the second derivative is greatest, when the amount of change changes the most (the elbow). The elbow occurs at the 6th cluster.

Perhaps not coincidently, 6 clusters also has the highest silhouette score (right hand plot). The silhouette score computes the average distance between a player and all other players in this player’s cluster. It then divides this distance by the distance between this player and all players in the next nearest cluster. Silhouette scores range between -1 and 1 (where negative one means the player is in the wrong cluster, 0 means the clusters completely overlap, and 1 means the clusters are extermely well separated).

Six clusters has the highest silhouette score at 0.19. 0.19 is not great, and suggests a different clustering algorithm might be better. More on this later.

Because 6 clusters is the elbow and has the highest silhouette score, I will use 6 clusters in my grouping analysis. Okay, now that I decided on 6 clusters lets see what players fall into what clusters!

1
2
3
4
5
6
7
8
9
10
final_fit = KMeans(n_clusters=6).fit(reduced_data) #fit 6 clusters
df['kmeans_label'] = final_fit.labels_ #label each data point with its clusters
df['PLAYER_ID'] = saveIDs #of course we want to know what players are in what cluster
player_names = [pd.DataFrame(players_df[players_df['PERSON_ID']==x]['DISPLAY_LAST_COMMA_FIRST']).to_string(header=False,index=False) for x in df['PLAYER_ID']]
# because playerID #s mean nothing to me, lets get the names too
df['Name'] = player_names

#lets also create a dataframe with data about where the clusters occur in the 5 component PCA space.
cluster_locs = pd.DataFrame(final_fit.cluster_centers_,columns=['component %s'% str(s) for s in range(np.size(final_fit.cluster_centers_,1))])
cluster_locs.columns = factor_names

Awesome. Now lets see how all the clusters look. These clusters were created in 5 dimensional space, which is not easy to visualize. Below I plot another scatter matrix. The scatter matrix allows us to visualize the clusters in different 2D combinations of the 5D space.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from scipy.stats import gaussian_kde
plt.clf()

centroids = final_fit.cluster_centers_ #find centroids so we can plot them
colors = ['r','g','y','b','c','m'] #cluster colors
Clusters = ['Cluster 0','Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5'] #cluster...names

numdata, numvars = reduced_data.shape #players by PCA components
fig, axes = plt.subplots(nrows=numvars, ncols=numvars, figsize=(10,10)) #create a scatter matrix with 5**2 cells
fig.subplots_adjust(hspace=0.05, wspace=0.05)

recs=[]
for col in colors: #make some patches for the legend
    recs.append(mpl.patches.Rectangle((0,0),1,1,fc=col))

fig.legend(recs,Clusters,8,ncol=6) #create legend with patches above

for i,ax in enumerate(axes.flat):
    # Hide all ticks and labels
    plt.setp(ax.get_yticklabels(), visible=False) #tick labels are too much with this many subplots
    plt.setp(ax.get_xticklabels(), visible=False)
    ax.grid(False) #again, too much
    if i%5==0:ax.set_ylabel(factor_names[i/5]) #label outer y axes
    if i>19:ax.set_xlabel(factor_names[i-20]) #label outer x axes

for i, j in zip(*np.triu_indices_from(axes, k=1)):
    for x, y in [(i,j), (j,i)]:
        #plot individual data points and cluster centers
        axes[y,x].plot(reduced_data[:, x], reduced_data[:, y], 'k.', markersize=2)
        axes[y,x].scatter(centroids[:,x], centroids[:,y],marker='x',s=169,linewidth=3,color=colors, zorder=10)

#create kernel density estimation for each PCA factor on the diagonals
for i, label in enumerate(factor_names):
    density = gaussian_kde(reduced_data[:,i])
    density.covariance_factor = lambda : .25
    density._compute_covariance()
    x = np.linspace(min(reduced_data[:,i]),max(reduced_data[:,1]))
    axes[i,i].plot(x,density(x))

In this plot above. I mark the center of a given cluster with an X. For example, Cluster 0 and Cluster 5 are both high in outside skills. Cluster 5 is also high in rim scoring, but low in pure points.

Below I look at the players in each cluster. The first thing I do is identify the player closest to the cluster’s center. I call this player the prototype. It is the player that most exemplifies a cluster.

I then show a picture of this player because… well I wanted to see who these players were. I print out this player’s stats and the cluster’s centroid location. Finally, I print out the first ten players in this cluster. This is the first ten players alphabetically. Not the ten players closest to cluster center.

1
2
3
4
5
6
7
8
9
10
11
from IPython.display import display
from IPython.display import Image

name = player_names[np.argmin([euclidean(x,final_fit.cluster_centers_[0]) for x in reduced_data])] #find cluster prototype
PlayerID = str(int(df[df['Name']==name]['PLAYER_ID'])) #get players ID number
#player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png")
player = Image(url = 'http://4.bp.blogspot.com/_RaOrchOImw8/S3mNk3exLeI/AAAAAAAAdZk/Hs-81mnXO_E/s400/Lloyd+Daniels.jpg',width=100)
display(player)
#display(df[df['Name']==name]) #prototype's stats
display(cluster_locs[0:1]) #cluster centroid location
df[df['kmeans_label']==0]['Name'][:10] #first ten players in the cluster (alphabetically)

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye
0.830457 -0.930833 0.28203 -0.054093 0.43606
16       Afflalo, Arron
20         Ainge, Danny
40           Allen, Ray
46        Alston, Rafer
50     Aminu, Al-Farouq
53      Andersen, David
54       Anderson, Alan
56      Anderson, Derek
60      Anderson, James
63       Anderson, Kyle
Name: Name, dtype: object

First, let me mention that cluster number is a purely categorical variable. Not ordinal. If you run this analysis, you will likely create clusters with similar players, but in a different order. For example, your cluster 1 might be my cluster 0.

Cluster 0 has the most players (25%; about 490 of the 1965 in this cluster analysis) and is red in the scatter matrix above.

Cluster 0 players are second highest in outside shooting (in the table above you can see their average score on the outside skills component is 0.83). These players are lowest in rim scoring (-0.93), so they do not draw many fouls - they are basically the snipers from the outside.

The prototype is Lloyd Daniels who takes a fair number of 3s. I wouldn’t call 31% a dominant 3-point percentage, but its certainly not bad. Notably, Lloyd Daniels doesn’t seem to do much but shoot threes, as 55% of his shots come from the great beyond.

Cluster 0 notable players include Andrea Bargnani, JJ Barea, Danilo Gallinari, and Brandon Jennings. Some forwards. Some Guards. Mostly good shooters.

On to Cluster 1… I probably should have made a function from this code, but I enjoyed picking the players pictures too much.

1
2
3
4
5
6
7
8
name = player_names[np.argmin([euclidean(x,final_fit.cluster_centers_[1]) for x in reduced_data])]
PlayerID = str(int(df[df['Name']==name]['PLAYER_ID']))
#player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png")
player = Image(url = 'https://allshookdown.files.wordpress.com/2009/06/laettner_200_921121.jpg?w=200&h=300',width=100)
display(player)
#display(df[df['Name']==name])
display(cluster_locs[1:2])
df[df['kmeans_label']==1]['Name'][:10]

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye
-0.340177 1.008111 1.051622 -0.150204 0.599516
1         Abdul-Jabbar, Kareem
4         Abdur-Rahim, Shareef
9                 Adams, Alvan
18               Aguirre, Mark
75      Antetokounmpo, Giannis
77            Anthony, Carmelo
85             Arenas, Gilbert
121                 Baker, Vin
133           Barkley, Charles
148            Bates, Billyray
Name: Name, dtype: object

Cluster 1 is green in the scatter matrix and includes about 14% of players.

Cluster 1 is highest on the rim scoring, pure points, and Dead Eye components. These players get the ball in the hoop.

Christian Laettner is the prototype. He’s a solid scoring forward.

Gilbert Arenas stands out in the first ten names as I was tempted to think of this cluster as big men, but it really seems to be players who shoot, score, and draw fouls.

Cluster 1 Notable players include James Harden,Kevin Garnet, Kevin Durant, Tim Duncan, Kobe, Lebron, Kevin Martin, Shaq, Anthony Randolph??, Kevin Love, Derrick Rose, and Michael Jordan.

1
2
3
4
5
6
7
8
name = player_names[np.argmin([euclidean(x,final_fit.cluster_centers_[2]) for x in reduced_data])]
PlayerID = str(int(df[df['Name']==name]['PLAYER_ID']))
#player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png")
player = Image(url = 'http://imageocd.com/images/nba10/doug-west-wallpaper-height-weight-position-college-high-school/doug-west.jpg',width=100)
display(player)
#display(df[df['Name']==name])
display(cluster_locs[2:3])
df[df['kmeans_label']==2]['Name'][:10]

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye
0.013618 0.101054 0.445377 -0.347974 -1.257634
2      Abdul-Rauf, Mahmoud
3       Abdul-Wahad, Tariq
5           Abernethy, Tom
10           Adams, Hassan
14         Addison, Rafael
24            Alarie, Mark
27      Aldridge, LaMarcus
31     Alexander, Courtney
35           Alford, Steve
37            Allen, Lavoy
Name: Name, dtype: object

Cluster 2 is yellow in the scatter matrix and includes about 17% of players.

Lots of big men who are not outside shooters and don’t draw many fouls. These players are strong 2 point shooters and free throw shooters. I think of these players as mid-range shooters. Many of the more recent Cluster 2 players are forwards since mid-range guards do not have much of a place in the current NBA.

Cluster 2’s prototype is Doug West. Doug West shoots well from the free throw line and on 2-point attempts, but not the 3-point line. He does not draw many fouls or collect many rebounds.

Cluster 2 noteable players include LaMarcus Aldridge, Tayshaun Prince, Thaddeus Young, and Shaun Livingston

1
2
3
4
5
6
7
8
name = player_names[np.argmin([euclidean(x,final_fit.cluster_centers_[3]) for x in reduced_data])]
PlayerID = str(int(df[df['Name']==name]['PLAYER_ID']))
#player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png")
player = Image(url = 'http://4.bp.blogspot.com/_f5MWZq1BJXI/TCDRy3v6m9I/AAAAAAAAACc/Ux8M7uiadoc/s400/cato.jpg',width=100)
display(player)
#display(df[df['Name']==name])
display(cluster_locs[3:4])
df[df['kmeans_label']==3]['Name'][:10]

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye
-1.28655 -0.467105 -0.133546 0.905368 0.000679
7            Acres, Mark
8            Acy, Quincy
13         Adams, Steven
15          Adrien, Jeff
21        Ajinca, Alexis
26         Aldrich, Cole
34     Alexander, Victor
45       Alston, Derrick
51         Amundson, Lou
52       Andersen, Chris
Name: Name, dtype: object

Cluster 3 is blue in the scatter matrix and includes about 16% of players.

Cluster 3 players do not have outside skills such as assists and 3-point shooting (they’re last in outside skills). They do not draw many fouls or shoot well from the free throw line. These players do not shoot often, but have a decent shooting percentage. This is likely because they only shoot when wide open next to the hoop.

Cluster 3 players are highest on the defensive big man component. They block lots of shots and collect lots of rebounds.

The Cluster 3 prototype is Kelvin Cato. Cato is not and outside shooter and he only averages 7.5 shots per 36, but he makes these shots at a decent clip. Cato averages about 10 rebounds per 36.

Notable Cluster 3 players include Andrew Bogut, Tyson Chandler, Andre Drummond, Kawahi Leonard??, Dikembe Mutumbo, and Hassan Whiteside.

1
2
3
4
5
6
7
8
name = player_names[np.argmin([euclidean(x,final_fit.cluster_centers_[4]) for x in reduced_data])]
PlayerID = str(int(df[df['Name']==name]['PLAYER_ID']))
#player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png")
player = Image(url = 'http://www.thenolookpass.com/wp-content/uploads/2012/01/IMG-724x1024.jpg', width=100) #a photo just for fun
display(player)
#display(df[df['Name']==name])
display(cluster_locs[4:5])
df[df['kmeans_label']==4]['Name'][:10]

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye
-0.668445 0.035927 -0.917479 -1.243347 0.244897
0         Abdelnaby, Alaa
17          Ager, Maurice
28       Aleksinas, Chuck
33         Alexander, Joe
36          Allen, Jerome
48          Amaechi, John
49          Amaya, Ashraf
74          Anstey, Chris
82         Araujo, Rafael
89     Armstrong, Brandon
Name: Name, dtype: object

Cluster 4 is cyan in the scatter matrix above and includes the least number of players (about 13%).

Cluster 4 players are not high on outsize skills. They are average on rim scoring. They do not score many points, and they don’t fill up the defensive side of the stat sheet. These players don’t seem like all stars.

Looking at Doug Edwards’ stats - certainly not a 3-point shooter. I guess a good description of cluster 4 players might be … NBA caliber bench warmers.

Cluster 4’s notable players include Yi Jianlian and Anthony Bennet….yeesh

1
2
3
4
5
6
7
8
9
name = player_names[np.argmin([euclidean(x,final_fit.cluster_centers_[5]) for x in reduced_data])]
#PlayerID = str(int(df[df['Name']==name]['PLAYER_ID']))
PlayerID = str(76993) #Two Gerald Hendersons!
#player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png", width=100)
player = Image(url = 'http://cdn.fansided.com/wp-content/blogs.dir/18/files/2014/04/b__gerald_henderson_82.jpg', width=100)
display(player)
#display(df[df['PLAYER_ID']==76993])
display(cluster_locs[5:])
df[df['kmeans_label']==5]['Name'][:10]

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye
0.890984 0.846109 -0.926444 0.735306 -0.092395
12          Adams, Michael
30         Alexander, Cory
41             Allen, Tony
62         Anderson, Kenny
65      Anderson, Mitchell
78           Anthony, Greg
90      Armstrong, Darrell
113           Bagley, John
126          Banks, Marcus
137         Barrett, Andre
Name: Name, dtype: object

Cluster 5 is magenta in the scatter matrix and includes 16% of players.

Cluster 5 players are highest in outside skills and second highest in rim scoring yet these players are dead last in pure points. It seems they score around the rim, but do not draw many fouls. They are second highest in defensive big man.

Gerald Henderson Sr is the prototype. Henderson is a good 3 point and free throw shooter but does not draw many fouls. He has lots of assists and steals.

Of interest mostly because it generates an error in my code, Gerald Henderson Jr is in cluster 2 - the mid range shooters.

Notable cluster 5 players include Mugsy Bogues, MCW, Jeff Hornacek, Magic Johnson, Jason Kidd, Steve Nash, Rajon Rando, John Stockton. Lots of guards.

In the cell below, I plot the percentage of players in each cluster.

1
2
3
4
plt.clf()
plt.hist(df['kmeans_label'], normed=True,bins=[0,1,2,3,4,5,6],rwidth = 0.5)
plt.xticks([0.5,1.5,2.5,3.5,4.5,5.5],['Group 0','Group 1','Group 2','Goup 3','Group 4','Group 5'])
plt.ylabel('Percentage of Players in Each Cluster');

I began this post by asking whether player positions is the most natural way to group NBA players. The clustering analysis here suggests not.

Here’s my take on the clusters: Cluster 0 is pure shooters, Cluster 1 is talented scorers, Cluster 2 is mid-range shooters, Cluster 3 is defensive big-men, Cluster 4 is bench warmers, Cluster 5 is distributors. We might call the “positions” shooters, scorers, rim protectors, and distributors.

It’s possible that our notion of position comes more from defensive performance than offensive. On defense, a player must have a particular size and agility to guard a particular opposing player. Because of this, a team will want a range of sizes and agility - strong men to defend the rim and quick men to defend agile ball carriers. Box scores are notoriously bad at describing defensive performance. This could account for the lack of “positions” in my cluster.

I did not include player height and weight in this analysis. I imagine height and weight might have made clusters that resemble the traditional positions. I chose to not include height and weight because these are player attributes; not player performance.

After looking through all the groups one thing that stands out to me is the lack of specialization. For example we did not find a single cluster of incredible 3-point shooters. Cluster 1 includes many great shooters, but it’s not populated exclusively by great shooters. It would be interesting if adding additional clusters to the analysis could find more specific clusters such as big-men that can shoot from the outside (e.g., Dirk) or high-volume scorers (e.g., Kobe).

I tried to list some of the aberrant cluster choices in the notable players to give you an idea for the amount of error in the clustering. These aberrant choices are not errors, they are simply an artifact of how k-means defines clusters. Using a different clustering algorithm would produce different clusters. On that note, the silhouette score of this clustering model is not great, yet the clustering algorithm definitely found similar players, so its not worthless. Nonetheless, clusters of NBA players might not be spherical. This would prevent a high silhouette score. Trying a different algorithm without the spherical clusters assumption would definitely be worthwhile.

Throughout this entire analysis, I was tempted to think about group membership, as a predictor of a player’s future performance. For instance, when I saw Karl Anthony Towns in the same cluster as Kareem Abdul-Jabbar, I couldn’t help but think this meant good things for Karl Anthony Towns. Right now, this doesn’t seem justified. No group included less than 10% of players so not much of an oppotunity for a uniformly “star” group to form. Each group contained some good and some bad players. Could more clusters change this? I plan on examining whether more clusters can improve the clustering algorithm’s ability to find clusters of exclusively quality players. If it works, I’ll post it here.

Creating NBA Shot Charts

Here I create shot charts depicting both shooting percentage and the number of shots taken at different court locations, similar to those produced on Austin Clemens’ website (http://www.austinclemens.com/shotcharts/).

To create the shooting charts, I looked to a post by Savvas Tjortjoglou (http://savvastjortjoglou.com/nba-shot-sharts.html). Savvas’ post is great, but his plots only depict the number of shots taken at different locations.

I’m interested in both the number of shots AND the shooting percentage at different locations. This requires a little bit more work. Here’s how I did it.

1
2
3
#import some libraries and tell ipython we want inline figures rather than interactive figures. 
%matplotlib inline
import matplotlib.pyplot as plt, pandas as pd, numpy as np, matplotlib as mpl

First, we have to acquire shooting data about each player. I retrieved the data from NBA.com’s API using code from Savvas Tjortjoglou’s post.

I won’t show you the output of this function. If you’re interested in the details, I recommend Savvas Tjortjoglou’s post.

1
2
3
4
5
6
7
8
9
10
11
12
13
def aqcuire_shootingData(PlayerID,Season):
    import requests
    shot_chart_url = 'http://stats.nba.com/stats/shotchartdetail?CFID=33&CFPARAMS='+Season+'&ContextFilter='\
                    '&ContextMeasure=FGA&DateFrom=&DateTo=&GameID=&GameSegment=&LastNGames=0&LeagueID='\
                    '00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PaceAdjust='\
                    'N&PerMode=PerGame&Period=0&PlayerID='+PlayerID+'&PlusMinus=N&Position=&Rank='\
                    'N&RookieYear=&Season='+Season+'&SeasonSegment=&SeasonType=Regular+Season&TeamID='\
                    '0&VsConference=&VsDivision=&mode=Advanced&showDetails=0&showShots=1&showZones=0'
    response = requests.get(shot_chart_url)
    headers = response.json()['resultSets'][0]['headers']
    shots = response.json()['resultSets'][0]['rowSet']
    shot_df = pd.DataFrame(shots, columns=headers)
    return shot_df

Next, we need to draw a basketball court which we can draw the shot chart on. This basketball court has to use the same coordinate system as NBA.com’s API. For instance, 3pt shots have to be X units from hoop and layups have to be Y units from the hoop. Again, I recycle code from Savvas Tjortjoglou (phew! figuring out NBA.com’s coordinate system would have taken me awhile).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def draw_court(ax=None, color='black', lw=2, outer_lines=False):
    from matplotlib.patches import Circle, Rectangle, Arc
    if ax is None:
        ax = plt.gca()
    hoop = Circle((0, 0), radius=7.5, linewidth=lw, color=color, fill=False)
    backboard = Rectangle((-30, -7.5), 60, -1, linewidth=lw, color=color)
    outer_box = Rectangle((-80, -47.5), 160, 190, linewidth=lw, color=color,
                          fill=False)
    inner_box = Rectangle((-60, -47.5), 120, 190, linewidth=lw, color=color,
                          fill=False)
    top_free_throw = Arc((0, 142.5), 120, 120, theta1=0, theta2=180,
                         linewidth=lw, color=color, fill=False)
    bottom_free_throw = Arc((0, 142.5), 120, 120, theta1=180, theta2=0,
                            linewidth=lw, color=color, linestyle='dashed')
    restricted = Arc((0, 0), 80, 80, theta1=0, theta2=180, linewidth=lw,
                     color=color)
    corner_three_a = Rectangle((-220, -47.5), 0, 140, linewidth=lw,
                               color=color)
    corner_three_b = Rectangle((220, -47.5), 0, 140, linewidth=lw, color=color)
    three_arc = Arc((0, 0), 475, 475, theta1=22, theta2=158, linewidth=lw,
                    color=color)
    center_outer_arc = Arc((0, 422.5), 120, 120, theta1=180, theta2=0,
                           linewidth=lw, color=color)
    center_inner_arc = Arc((0, 422.5), 40, 40, theta1=180, theta2=0,
                           linewidth=lw, color=color)
    court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw,
                      bottom_free_throw, restricted, corner_three_a,
                      corner_three_b, three_arc, center_outer_arc,
                      center_inner_arc]
    if outer_lines:
        outer_lines = Rectangle((-250, -47.5), 500, 470, linewidth=lw,
                                color=color, fill=False)
        court_elements.append(outer_lines)

    for element in court_elements:
        ax.add_patch(element)

    ax.set_xticklabels([])
    ax.set_yticklabels([])
    ax.set_xticks([])
    ax.set_yticks([])
    return ax

We want to create an array of shooting percentages across the different locations in our plot. I decided to group locations into evenly spaced hexagons using matplotlib’s hexbin function (http://matplotlib.org/api/pyplot_api.html). This function will count the number of times a shot is taken from a location in each of the hexagons.

The hexagons are evenly spaced across the xy grid. The variable “gridsize” controls the number of hexagons. The variable “extent” controls where the first hexagon and last hexagon are drawn (ordinarily the first hexagon is drawn based on the location of the first shot).

Computing shooting percentages requires counting the number of made and taken shots in each hexagon, so I run hexbin once using all shots taken and once using only the location of made shots. Then I simply divide the number of made shots by taken shots at each location.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def find_shootingPcts(shot_df, gridNum):
    x = shot_df.LOC_X[shot_df['LOC_Y']<425.1] #i want to make sure to only include shots I can draw
    y = shot_df.LOC_Y[shot_df['LOC_Y']<425.1]

    x_made = shot_df.LOC_X[(shot_df['SHOT_MADE_FLAG']==1) & (shot_df['LOC_Y']<425.1)]
    y_made = shot_df.LOC_Y[(shot_df['SHOT_MADE_FLAG']==1) & (shot_df['LOC_Y']<425.1)]

    #compute number of shots made and taken from each hexbin location
    hb_shot = plt.hexbin(x, y, gridsize=gridNum, extent=(-250,250,425,-50));
    plt.close() #don't want to show this figure!
    hb_made = plt.hexbin(x_made, y_made, gridsize=gridNum, extent=(-250,250,425,-50),cmap=plt.cm.Reds);
    plt.close()

    #compute shooting percentage
    ShootingPctLocs = hb_made.get_array() / hb_shot.get_array()
    ShootingPctLocs[np.isnan(ShootingPctLocs)] = 0 #makes 0/0s=0
    return (ShootingPctLocs, hb_shot)

I really liked how Savvas Tjortjoglou included players’ pictures in his shooting charts, so I recycled this part of his code too. The picture will appear in the bottom right hand corner of the shooting chart

1
2
3
4
5
6
7
8
9
def acquire_playerPic(PlayerID, zoom, offset=(250,400)):
    from matplotlib import  offsetbox as osb
    import urllib
    pic = urllib.urlretrieve("http://stats.nba.com/media/players/230x185/"+PlayerID+".png",PlayerID+".png")
    player_pic = plt.imread(pic[0])
    img = osb.OffsetImage(player_pic, zoom)
    #img.set_offset(offset)
    img = osb.AnnotationBbox(img, offset,xycoords='data',pad=0.0, box_alignment=(1,0), frameon=False)
    return img

I want to depict shooting percentage using a sequential colormap - more red circles = better shooting percentage. The “reds” colormap looks great, but would depict a 0% shooting percentage as white (http://matplotlib.org/users/colormaps.html), and white circles will not appear in my plots. I want 0% shooting to be slight pink, so below I modify the reds colormap.

1
2
3
4
5
6
7
8
9
#cmap = plt.cm.Reds
#cdict = cmap._segmentdata
cdict = {
    'blue': [(0.0, 0.6313725709915161, 0.6313725709915161), (0.25, 0.4470588266849518, 0.4470588266849518), (0.5, 0.29019609093666077, 0.29019609093666077), (0.75, 0.11372549086809158, 0.11372549086809158), (1.0, 0.05098039284348488, 0.05098039284348488)],
    'green': [(0.0, 0.7333333492279053, 0.7333333492279053), (0.25, 0.572549045085907, 0.572549045085907), (0.5, 0.4156862795352936, 0.4156862795352936), (0.75, 0.0941176488995552, 0.0941176488995552), (1.0, 0.0, 0.0)],
    'red': [(0.0, 0.9882352948188782, 0.9882352948188782), (0.25, 0.9882352948188782, 0.9882352948188782), (0.5, 0.9843137264251709, 0.9843137264251709), (0.75, 0.7960784435272217, 0.7960784435272217), (1.0, 0.40392157435417175, 0.40392157435417175)]
}

mymap = mpl.colors.LinearSegmentedColormap('my_colormap', cdict, 1024)

Okay, now lets put it all together. The large function below will use the functions above to create a shot chart depicting shooting percentage as the color of a circle (more red = better shooting %) and the number of shots as the size of a circle (larger circle = more shots). One note about the circle sizes, the size of a circle can increase until they start to overlap. When they start to overlap, I prevent them from growing.

In this function, I compute the shooting percentages and number of shots at each location. Then I draw circles depicting the number of shots taken at that location (circle size) and the shooting percentage at that location (circle color).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def shooting_plot(shot_df, plot_size=(12,8),gridNum=30):
    from matplotlib.patches import Circle
    x = shot_df.LOC_X[shot_df['LOC_Y']<425.1]
    y = shot_df.LOC_Y[shot_df['LOC_Y']<425.1]

    #compute shooting percentage and # of shots
    (ShootingPctLocs, shotNumber) = find_shootingPcts(shot_df, gridNum)

    #draw figure and court
    fig = plt.figure(figsize=plot_size)#(12,7)
    cmap = mymap #my modified colormap
    ax = plt.axes([0.1, 0.1, 0.8, 0.8]) #where to place the plot within the figure
    draw_court(outer_lines=False)
    plt.xlim(-250,250)
    plt.ylim(400, -25)

    #draw player image
    zoom = np.float(plot_size[0])/(12.0*2) #how much to zoom the player's pic. I have this hackily dependent on figure size
    img = acquire_playerPic(PlayerID, zoom)
    ax.add_artist(img)

    #draw circles
    for i, shots in enumerate(ShootingPctLocs):
        restricted = Circle(shotNumber.get_offsets()[i], radius=shotNumber.get_array()[i],
                            color=cmap(shots),alpha=0.8, fill=True)
        if restricted.radius > 240/gridNum: restricted.radius=240/gridNum
        ax.add_patch(restricted)

    #draw color bar
    ax2 = fig.add_axes([0.92, 0.1, 0.02, 0.8])
    cb = mpl.colorbar.ColorbarBase(ax2,cmap=cmap, orientation='vertical')
    cb.set_label('Shooting %')
    cb.set_ticks([0.0, 0.25, 0.5, 0.75, 1.0])
    cb.set_ticklabels(['0%','25%', '50%','75%', '100%'])

    plt.show()
    return ax

Ok, thats it! Now, because I’m a t-wolves fan, I’ll output the shot charts of top 6 t-wolves in minutes this year.

1
2
3
PlayerID = '203952' #andrew wiggins
shot_df = aqcuire_shootingData(PlayerID,'2015-16')
ax = shooting_plot(shot_df, plot_size=(12,8));

1
2
3
PlayerID = '1626157' #karl anthony towns
shot_df = aqcuire_shootingData(PlayerID,'2015-16')
ax = shooting_plot(shot_df, plot_size=(12,8));

1
2
3
PlayerID = '203897' #zach lavine
shot_df = aqcuire_shootingData(PlayerID,'2015-16')
ax = shooting_plot(shot_df, plot_size=(12,8));

1
2
3
PlayerID = '203476' #gorgui deing
shot_df = aqcuire_shootingData(PlayerID,'2015-16')
ax = shooting_plot(shot_df, plot_size=(12,8));

1
2
3
PlayerID = '2755' #kevin martin
shot_df = aqcuire_shootingData(PlayerID,'2015-16')
ax = shooting_plot(shot_df, plot_size=(12,8));

1
2
3
PlayerID = '201937' #ricky rubio
shot_df = aqcuire_shootingData(PlayerID,'2015-16')
ax = shooting_plot(shot_df, plot_size=(12,8));

One concern with my plots is the use of hexbin. It’s a bit hacky. In particular, it does not account for the nonlinearity produced by the 3 point line (some hexbins include both long 2-pt shots and 3-pt shots). It would be nice to limit some bins to 3-pt shots, but I can’t think of a way to do this without hardcoding the locations. One advantage with the hexbin method is I can easily change the number of bins. I’m not sure I could produce equivalent flexibility with a plot that bins 2-pt and 3-pt shots seperately.

Another concern is my plots treat all shots as equal, which is not fair. Shooting 40% from the restricted area and behind the 3-pt line are very different. Austin Clemens accounts for this by plotting shooting percentage relative to league average. Maybe I’ll implement something similar in the future.