Here is a tidbit of code which replicates SQL’s “not in” command, while keeping your data with the workers (it will require a shuffle).
I start by creating some small dataframes.
1 2 3 4 

Take a quick look at dataframe a.
1


id  valueA 

1  a 
2  b 
3  c 
And dataframe b.
1


id  valueA 

1  a 
4  d 
5  e 
I create a new column in a that is all ones. I could have used an existing column, but this way I know the column is never null.
1 2 

id  valueA  inA 

1  a  1 
2  b  1 
3  c  1 
I join a and b with a left join. This way all values in b which are not in a have null values in the column “inA”.
1


id  valueA  valueB  inA 

5  e  null  null 
1  a  a  1 
4  d  null  null 
By filtering out rows in the new dataframe c, which are not null, I remove all values of b, which were also in a.
1 2 

id  valueA  valueB  inA 

5  e  null  null 
4  d  null  null 
Each header in this post represents a different technical area. Following the header I describe what I would know before walking into an interview.
SQL is not often used in academia, but it’s probably the most important skill in data science (how do you think you’ll get your data??). It’s used every day by data scientists at every company, and while it’s 100% necessary to know, it’s stupidly boring to learn. But, once you get the hang of it, it’s a fun language because it requires a lot of creativity. To learn SQL, I would start by doing the mode analytics tutorials, then the sql zoo problems. Installing postgres on your personal computer and fetching data in Python with psycopg2 or sqlalchemy is a good idea. After, completing all this, move onto query optimization (where the creativity comes into play)  check out the explain function and order of execution. Shameless self promotion: I made a SQL presentation on what SQL problems to know for job interviews.
Some places use R. Some places use Python. It sucks, but these languages are not interchangeable (an R team will not hire someone who only knows Python). Whatever language you choose, you should know it well because this is a tool you will use every day. I use Python, so what follows is specific to Python.
I learned Python with codeacademy and liked it. If you’re already familiar with Python I would practice “white board” style questions. Feeling comfortable with the beginner questions on a site like leetcode or hackerrank would be a good idea. Writing answers while thinking about code optimization is a plus.
Jeff Knupp’s blog has great tidbits about developing in python; it’s pure gold.
Another good way to learn is to work on your digital profile. If you haven’t already, I would start a blog (I talk more about this is Post 1).
When starting here, the Andrew Ng coursera course is a great intro. While it’s impossible to learn all of it, I love to use elements of statistical learning and it’s sibling book introduction to statistical learning as a reference. I’ve heard good things about Python Machine Learning but haven’t checked it out myself.
As a psychology major, I felt relatively well prepared in this regard. Experience with linearmixed effects, hypothesistesting, regression, etc. serves Psychology PhDs well. This doesn’t mean you can forget Stats 101 though. Once, I found myself uncomfortably surprised by a very basic probability question.
Here’s a quick list of Statistics/ML algorithms I often use: GLMs and their regularization methods are a must (L1 and L2 regularization probably come up in 75% of phone screens). Hyperparameter search. Crossvalidation! Treebased models (e.g., random forests, boosted decision trees). I often use XGBoost and have found its intro post helpful.
I think you’re better off deeply (pun not intended) learning the basics (e.g., linear and logistic regression) than learning a smattering of newer, fancier methods (e.g., deep learning). This means thinking about linear regression from first principles (what are the assumptions and given these assumptions can you derive the bestfit parameters of a linear regression?). I can’t tell you how many hours I’ve spent studying Andrew Ng’s first supervised learning lecture for this. It’s good to freshen up on linear algebra and there isn’t a better way to do this than the 3Blue1Brown videos; they’re amazing. This might seem too introductory/theoretical, but it’s necessary and often comes up in interviews.
Be prepared to talk about the biasvariance tradeoff. Everything in ML comes back to the biasvariance tradeoff so it’s a great interview question. I know some people like to ask candidates about feature selection. I think this question is basically a rephrasing of the biasvariance tradeoff.
Make a github account if you haven’t already. Get used to commits, pushing, and branching. This won’t take long to get the hang of, but, again, it’s something you will use every day.
As much as possible I would watch code etiquette. I know this seems anal, but it matters to some people (myself included), and having pep8 quality code can’t hurt. There’s a number of python modules that will help here. Jeff Knupp also has a great post about linting/automating code etiquette.
Unittests are a good thing to practice/be familiar with. Like usual, Jeff Knupp has a great post on the topic.
I want to mention that getting a data science job is a little like getting a grant. Each time you apply, there is a low chance of getting the job/grant (luckily, there are many more jobs than grants). When creating your application/grant, it’s important to find ways to get people excited about your application/grant (e.g., showing off your statistical chops). This is where code etiquette comes into play. The last thing you want is to diminish someone’s excitement about you because you didn’t include a doc string. Is code etiquette going to remove you from contention for a job? Probably not. But it could diminish someone’s excitement.
One set of skills that I haven’t touched on is cluster computing (e.g., Hadoop, Spark). Unfortunately, I don’t think there is much you can do here. I’ve heard good things about the book Learning Spark, but books can only get you so far. If you apply for a job that wants Spark, I would install Spark on your local computer and play around, but it’s hard to learn cluster computing when you’re not on a cluster. Spark is more or less fancy SQL (aside from the ML aspects), so learning SQL is a good way to prepare for a Spark mindset. I didn’t include cluster computing above, because many teams seem okay with employees learning this on the job.
Not that there’s a lack of content here, but here’s a good list of must know topics that I used when transitioning from academia to data science.
]]>Before I get started, I want to thank Rick Wolf for providing comments on an earlier version of this post.
This first post is a series of general questions I’ve received. The second post will focus on technical skills required to get a job in data science.
Each header in this post represents a question. Below the header/question I record my response.
Anyone starting this process should know they are starting a marathon. Not a sprint. Making the leap from academia to data science is more than possible, but it takes time and dedication.
I think it can be a disadvantage in the job application process. Most people don’t understand how quantitative Psychology is, so psychology grads have to overcome these stereotypes. This doesn’t mean having a Psychology PhD is a disadvantage when it comes to BEING a data scientist. Having a Psychology PhD can be a huge advantage because Psychology PhDs have experience measuring behavior which is 90% of data science. Every company wants to know what their customers are doing and how to change their customers’ behavior. This is literally what Psychology PhDs do, so Psychology PhDs might have the most pertinent experience of any science PhD.
(I did the Insight Data Science bootcamp)
Apply when you’re good enough to get a phone screen but not good enough to get a job. Don’t count on a boot camp to give you all the skills. Instead, think of boot camps as polishing your skills.
Here is the game plan I would use:
Send out 34 job applications and see if you get any hits. If not, think about how you can improve your resume (see post #2), and go about those improvements. After a few iterations of this, you will start getting invitations to do phone screens. At this stage, a boot camp will be useful.
The boot camps are of varying quality. Ask around to get an idea for which boot camps are better or worse. Also, look into how each boot camp gets paid. If you pay tuition, the boot camp will care less about whether you get a job. If the boot camp gets paid through recruiting fees or collecting tuition from your paychecks, it is more invested in your job.
Yes, I consider this a must (and so do others). It’s a good opportunity to practice data science, and, more importantly, it’s a good opportunity to show off your skills.
Most people (including myself) host their page on github and generate the html with a static site generator. I use octopress, which works great. Most people seem to use pelican. I would recommend pelican because it’s built in Python. I haven’t used it, but a quick google search led me to this tutorial on building a github site with pelican.
I wish I’d sent more of my posts to friends/colleagues. Peer review is always good for a variety of reasons. I’d be more than happy to review posts for anyone reading this blog.
First, no one in industry cares about publications. People might notice if the journal is Science/Nature but most will not. Spend a few hours thinking about how to describe your academic accomplishments as technical skills. For example, as a Postdoc, I was on a Neurophysiology project that required writing code to collect, ingest, and transform electrophysiology data. In academia, none of this code mattered. In industry, it’s the only thing that matters. What I built was a datapipeline, and this is a product many companies desire.
We all have examples like this, but they’re not obvious because academics don’t know what companies want. Think of your datapipelines, your interactive experiments, your scripted analytics.
Transforming academic work into skills that companies desire will take a bit of creativity (I am happy to help with this), but remember that your goal here is to express how the technical skills you used in academia will apply to what you will do as a data scientist.
Many people (including myself) love to say they can learn fast. While this is an important skill it’s hard to measure and it calls attention to what you do not know. In general, avoid it.
I think a better question than what industry is what size of team/company you want to work on. At a big company you will have a more specific job with more specific requirements (and probably more depth of knowledge). At a smaller company, you will be expected to have a broader skill set. This matters in terms of what you want in a job and what skills you have. Having industry specific knowledge is awesome, but most academics have never worked in an industry so by definition they don’t have industry specific knowledge. Unfortunately, we just have to punt on this aspect of the job application.
No matter what your job is, having a good boss is important. If you get a funny feeling about a potential boss in the interview process, don’t take the job.
Some companies are trying to hire data scientists but don’t want to change their company. By this I mean they want their data scientists to work in excel. Excel is a great tool, but it’s not a tool I would want to use every day. If you feel the same way, then keep an eye out for this.
]]>Hopefully, I can save someone the same pain by writing this blog post.
I decided to use Cron to launch the weekly jobs. Actually launching a weekly job on Cron was not difficult. Check out the Ubuntu Cron manual for a good description on how to use Cron.
What took me forever was realizing that Cron jobs have an extremely limited path. Because of this, specifying the complete path to executed files and their executors is necessary.
Below I describe how I used an ec2 instance (Ubuntu 16.04) to automatically launch this weekly job.
First, here is what my Cron job list looks like (call “crontab e” in the terminal).
1 2 

The important thing to note here is that I am creating the variable SHELL, and $HOME is replaced by the actual path to my home directory.
Next, is the shell script called by Cron.
1 2 3 4 

Again, $HOME is replaced with the actual path to my home directory.
I had to make this shell script and the python script called within it executable (call “chmod +x” in the terminal). The reason that I used this shell script rather than directly launching the python script from Cron is I wanted access to environment variables in my bash_profile. In order to get access to them, I had to source bash_profile.
Finally, below I have the python file that executes the week job that I wanted. I didn’t include the code that actually launches our emr cluster because that wasn’t the hard part here, but just contact me if you would like to see it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

While the code is not included here, I use aws cli to launch my emr cluster, and I had to specify the path to aws (call “which aws” in the terminal) when making this call.
You might have noticed the logging I am doing in this script. I found logging both within this python script and piping the output of this script to additional logs helpful when debugging.
The Ubuntu Cron manual I linked above, makes it perfectly clear that my Cron path issues are common, but I wanted to post my solution in case other people needed a little guidance.
]]>This struck me as a great opportunity to do some quick data science. For this post, I scraped the names (from wikipedia) and ratings (from TMDb) of all American TV shows. I did the same for major American movies, so that I could have a comparison group (maybe all content is better or worse). The ratings are given by TMDb’s users and are scores between 1 and 10 (where 10 is a great show/movie and 1 is a lousy show/movie).
All the code for this post can be found on my github.
I decided to operationalize my “golden age of TV” hypothesis as the average TV show is better now than previously. This would be expressed as a positive slope (beta coefficient) when building a linear regression that outputs the rating of a show given the date on which the show first aired. My wife predicted a slope near zero or negative (shows are no better or worse than previously).
Below, I plot the ratings of TV shows and movies across time. Each show is a dot in the scatter plot. Show rating (average rating given my TMBb) is on the yaxis. The date of the show’s first airing is on the xaxis. When I encountered shows with the same name, I just tacked a number onto the end. For instance, show “x” would become show “x_1.” The size of each point in the scatter plot is the show’s “popularity”, which is a bit of a black box, but it’s given by TMBb’s API. TMDb does not give a full description of how they calculate popularity, but they do say its a function of how many times an item is viewed on TMDb, how many times an item is rated, and how many times the item has been added to watch or favorite list. I decided to depict it here just to give the figures a little more detail. The larger the dot, the more popular the show.
Here’s a plot of all TV shows across time.
To test the “golden age of TV” hypothesis, I coded up a linear regression in javascript (below). I put the regression’s output as a comment at the end of the code. Before stating whether the hypothesis was rejected or not, I should note that that I removed shows with less than 10 votes because these shows had erratic ratings.
As you can see, there is no evidence that TV is better now that previously. In fact, if anything, this dataset says that TV is worse (but more on this later).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

I wanted to include movies as a comparison to TV. Here’s a plot of all movies across time.
It’s important to note that I removed all movies with less than 1000 votes. This is completely 100% unfair, BUT I am very proud of my figures here and things get a little laggy when including too many movies in the plot. Nonetheless, movies seem to be getting worse over time! More dramatically than TV shows!
1 2 3 4 5 6 7 8 

Okay, so this was a fun little analysis, but I have to come out and say that I wasn’t too happy with my dataset and the conclusions we can draw from this analysis are only as good as the dataset.
The first limitation is that recent content is much more likely to receive a rating than older content, which could systematically bias the ratings of older content (e.g., only good shows from before 2000 receive ratings). It’s easy to imagine how this would lead us to believing that all older content is better than it actually was.
Also, TMDb seems to have IMDB type tastes by which I mean its dominated by young males. For instance, while I don’t like the show “Keeping up the Kardashians,” it’s definitely not the worst show ever. Also, “Girls” is an amazing show which gets no respect here. The quality of a show is in the eye of the beholder, which in this case seems to be boys.
I would have used Rotten Tomatoes’ API, but they don’t provide access to TV ratings.
Even with all these caveats in mind, it’s hard to defend my “golden age of TV” hypothesis. Instead, it seems like there is just more content being produced, which leads to more good shows (yay!), but the average show is no better or worse than previously.
]]>I’ve been using Kodi/XBMC since 2010. It provides a flexible and (relatively) intuitive interface for interacting with content through your TV (much like an apple TV). One of the best parts of Kodi is the addons  these are apps that you can build or download. For instance, I use the NBA League Pass addon for watching Wolves games. I’ve been looking for a reason to build my own Kodi addon for years.
Enter PBS NewsHour. If you’re not watching PBS NewsHour, I’m not sure what you’re doing with your life because it’s the shit. It rocks. PBS NewsHour disseminates all their content on youtube and their website. For the past couple years, I’ve been watching their broadcasts every morning through the Youtube addon. This works fine, but it’s clunky. I decided to stream line watching the NewsHour by building a Kodi addon for it.
I used this tutorial to build a Kodi addon that accesses the PBS NewsHour content through the youtube addon. This addon can be found on my github. The addon works pretty well, but it includes links to all NewsHour’s content, and I only want the full episodes. I am guessing I could have modified this addon to get what I wanted, but I really wanted to build my own addon from scratch.
The addon I built is available on my github. To build my addon, I used this tutorial, and some code from this github repository. Below I describe how the addon works. I only describe the file default.py because this file does the majority of the work, and I found the linked tutorials did a good job explaining the other files.
I start by importing libraries that I will use. Most these libraries are used for scraping content off the web. I then create some basic variables to describe the addon’s name (addonID), its name in kodi (base_url), the number used to refer to it (addon_handle  I am not sure how this number is used), and current arguments sent to my addon (args).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

The next function, getRequest, gathers html from a website (specified by the variable url). The dictionary httpHeaders tells the website a little about myself, and how I want the html. I use urllib2 to get a compressed version of the html, which is decompressed using zlib.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

The hardest part of building this addon was finding video links. I was able to find a github repo with code for identifying links to PBS’s videos, but PBS initially posts their videos on youtube. I watch PBS NewsHour the morning after it airs, so I needed a way to watch these youtube links. I started this post by saying I wanted to avoid using Kodi’s youtube addon, but I punted and decided to use the youtube addon to play these links. Below is a function for finding the youtube id of a video.
1 2 3 4 5 

This next function actually fetches the videos (the hard part of building this addon). This function fetches the html of the website that has PBS’s video. It then searches the html for “coveplayerid,” which is PBS’s name for the video. I use this name to create a url that will play the video. I get the html associated with this new url, and search it for a json file that contains the video. I grab this json file, and viola I have the video’s url! In the final part of the code, I request a higher version of the video than PBS would give me by default.
If I fail to find “coveplayerid,” then I know this is a video with a youtube link, so I grab the youtube id. Some pages have a coveplayerid class, but no actual coveplayerid. I also detect these cases and find the youtube id when it occurs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

This next function identifies full episodes that have aired in the past week. It’s the meat of the addon. The function gets the html of PBS NewsHour’s page, and finds all links in a sidebar where PBS lists their past week’s episodes. I loop through the links and create a menu item for each one. These menu items are python objects that Kodi can display to users. The items include a label/title (the name of the episode), an image, and a url that Kodi can use to find the video url.
The most important part of this listing is the url I create. This url gives Kodi all the information I just described, associates the link with an addon, and tells Kodi that the link is playable. In the final part of the function, I pass the list of links to Kodi.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

Okay, thats the hard part. The rest of the code implements the functions I just described. The function below is executed when a user chooses to play a video. It gets the url of the video, and gives this to the xbmc function that will play the video. The only hiccup here is I check whether the link is for the standard PBS video type or not. If it is, then I give the link directly to Kodi. If it’s not, then this is a youtube link and I launch the youtube plugin with my youtube video id.
1 2 3 4 5 6 7 8 9 

This final function is launched whenever a user calls the addon or executes an action in the addon (thats why I call the function in the final line of code here). params is an empty dictionary if the addon is being opened. params being empty causes the addon to call list_videos, creating the list of episodes that PBS has aired in the past week. If the user selects one of the episodes, then router is called again, but this time the argument is the url of the selected item. This url is passed to the play_video function, which plays the video for the user!
1 2 3 4 5 6 7 8 9 10 11 12 13 

That’s my addon! I hope this tutorial helps people create future Kodi addons. Definitely reach out if you have questions. Also, make sure to check out the NewsHour soon and often. It’s the bomb.
]]>At Insight, I built Sifting the Overflow, a chrome extension which you can install from the google chrome store. Sifting the Overflow identifies the most helpful parts of answers to questions about the programming language Python on StackOverflow.com. To created Sifting the Overflow, I trained a recurrent neural net (RNN) to identify “helpful” answers, and when you use the browser extension on a stackoverflow page, this RNN rates the helpfulness of each sentence of each answer. The sentences that my model believes to be helpful are highlighted so that users can quickly find the most helpful parts of these pages.
I wrote a quick post here about how I built Sifting the Overflow, so check it out if you’re interested. The code is also available on my github.
]]>In the Monty Hall problem, there is a car behind one of three doors. There are goats behind the other two doors. The contestant picks one of the three doors. Monty Hall (the game show host) then reveals that one of the two unchosen doors has a goat behind it. The question is whether the constestant should change the door they picked or keep their choice.
My first intuition was that it doesn’t matter whether the contestant changes their choice because its equally probable that the car is behind either of the two unopened doors, but I’ve been told this is incorrect! Instead, the contestant is more likely to win the car if they change their choice.
How can this be? Well, I decided to create a simple simulation of the Monty Hall problem in order to prove to myself that there really is an advantage to changing the chosen door and (hopefully) gain an intuition into how this works.
Below I’ve written my little simulation. A jupyter notebook with this code is available on my github.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

Here I plot the results
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

Probability of choosing correctly if change choice: 0.67
Probability of choosing correctly if do not change choice: 0.33
Probability of difference arising from chance: 0.00000
Clearly, the contestant should change their choice!
So now, just to make sure I am not crazy, I decided to simulate the Monty Hall problem with the contestant choosing what door to open after Monty Hall opens a door with a goat.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

1 2 3 4 5 6 7 8 9 10 

Probability of choosing correctly if change choice: 0.51
Probability of choosing correctly if do not change choice: 0.49
Probability of difference arising from chance: 0.57546
Now, there is clearly no difference between whether the contestant changes their choice or not.
So what is different about these two scenarios?
In the first scenario, the contestant makes a choice before Monty Hall reveals which of the two unchosen options is incorrect. Here’s the intution I’ve gained by doing this  because Monty Hall cannot reveal what is behind the chosen door, when Monty Hall reveals what is behind one of the unchosen doors, this has no impact on how likely the car is to appear behind the chosen door. Yet, the probability that the car is behind the revealed door drops to 0 (because Monty Hall shows there’s a goat behind it), and the total probability must be conserved so the second unchosen door receives any belief that the car was behind the revealed door! Thus, the unchosen and unrevealed door becomes 66% likely to contain the car! I am still not 100% convinced of this new intuition, but it seems correct given these simulations!
]]>I recently presented at the annual meeting of the society for neuroscience, so I wanted to do a quick post describing my findings.
The reinforcement learning literature postulates that we go in and out of exploratory states in order to learn about our environments and maximize the reward we gain in these environments. For example, you might try different foods in order to find the food you most prefer. But, not all novelty seeking behavior results from reward maximization. For example, I often read new books. Maybe reading a new book triggers a reward circuit response, but it certainly doesn’t lead to immediate rewards.
In this poster we used a free viewing task to examine whether an animal would exhibit a novelty preference when it was not associated with any possible rewards. We found the animal looked at (payed attention to) novel items more often than he looked at familiar items, but this preference for paying attention to novel items fluctuated over time. Sometimes the animal had a large preference for looking at the novel items and sometimes he had no preference for novels items.
Neurons that we recorded in the dlPFC and area 7a encoded whether the animal was currently in a state where he prefered looking at novel items or not and this encoding persisted across the entire trial period. Importantly, while neurons in these areas also encoded whether the animal was currently looking at a novel item or not, this encoding was distinct from the encoding of the current preference state. These results demonstrate that the animal had simultaneous neural codes representing whether he was acutely attending to novel items and his general preference for attending to novel items or not. Importantly, these neural codes existed even though there were no explicit reward associations.
]]>In this tutorial I generate fake data in order to help gain insight into the mechanics underlying PCA.
Below I create my first feature by sampling from a normal distribution. I create a second feature by adding a noisy normal distribution to the first feature multiplied by two. Because I generated the data here, I know it’s composed to two latent variables, and PCA should be able to identify these latent variables.
I generate the data and plot it below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

The first step before doing PCA is to normalize the data. This centers each feature (each feature will have a mean of 0) and divides data by its standard deviation (changing the standard deviation to 1). Normalizing the data puts all features on the same scale. Having features on the same scale is important because features might be more or less variable because of measurement rather than the latent variables producing the feature. For example, in basketball, points are often accumulated in sets of 2s and 3s, while rebounds are accumulated one at a time. The nature of basketball puts points and rebounds on a different scales, but this doesn’t mean that the latent variables scoring ability and rebounding ability are more or less variable.
Below I normalize and plot the data.
1 2 3 4 5 6 7 8 9 

After standardizing the data, I need to find the eigenvectors and eigenvalues. The eigenvectors point in the direction of a component and eigenvalues represent the amount of variance explained by the component. Below, I plot the standardized data with the eigenvectors ploted with their eigenvalues as the vectors distance from the origin.
As you can see, the blue eigenvector is longer and points in the direction with the most variability. The purple eigenvector is shorter and points in the direction with less variability.
As expected, one component explains far more variability than the other component (becaus both my features share variance from a single latent gaussian distribution).
1 2 3 4 5 6 7 8 9 10 

Next I order the eigenvectors according to the magnitude of their eigenvalues. This orders the components so that the components that explain more variability occur first. I then transform the data so that they’re axis aligned. This means the first component explain variability on the xaxis and the second component explains variance on the yaxis.
1 2 3 4 5 6 7 8 9 10 11 12 13 

Finally, just to make sure the PCA was done correctly, I will call PCA from the sklearn library, run it, and make sure it produces the same results as my analysis.
1 2 3 4 5 6 7 

(1.0, 0.0)
(1.0, 0.0)
]]>In 2012, Krizhevsky et al. released a convolutional neural network that completely blew away the field at the imagenet challenge. This model is called “Alexnet,” and 2012 marks the beginning of neural networks’ resurgence in the machine learning community.
Alexnet’s domination was not only exciting for the machine learning community. It was also exciting for the visual neuroscience community whose descriptions of the visual system closely matched alexnet (e.g., HMAX). Jim DiCarlo gave an awesome talk at the summer course describing his research comparing the output of neurons in the visual system and the output of “neurons” in alexnet (you can find the article here).
I find the similarities between the visual system and convolutional neural networks exciting, but check out the depictions of alexnet and the visual system above. Alexnet is depicted in the upper image. The visual system is depicted in the lower image. Comparing the two images is not fair, but the visual system is obviously vastly more complex than alexnet.
In my project, I applied a known complexity of the biological visual system to a convolutional neural network. Specifically, I incoporated visual attention into the network. Visual attention refers to our ability to focus cognitive processing onto a subset of the environment. Check out this video for an incredibly 90s demonstration of visual attention.
In this post, I demonstrate that implementing a basic version of visual attention in a convolutional neural net improves performance of the CNN, but only when classifying noisy images, and not when classifying relatively noiseless images.
Code for everything described in this post can be found on my github page. In creating this model, I cribbed code from both Jacob Gildenblat and this implementation of alexnet.
I implemented my model using the Keras library with a Theano backend, and I tested my model on the MNIST database. The MNIST database is composed of images of handwritten numbers. The task is to design a model that can accurately guess what number is written in the image. This is a relatively easy task, and the best models are over 99% accurate.
I chose MNIST because its an easy problem, which allows me to use a small network. A small network is both easy to train and easy to understand, which is good for an exploratory project like this one.
Above, I depict my model. This model has two convolutional layers. Following the convolutional layers is a feature averaging layer which borrows methods from a recent paper out of the Torralba lab and computes the average activity of units covering each location. The output of this feature averaging layer is then passed along to a fully connected layer. The fully connected layer “guesses” what the most likely digit is. My goal when I first created this network was to use this “guess” to guide where the model focused processing (i.e., attention), but I found guided models are irratic during training.
Instead, my current model directs attention to all locations that are predictive of all digits. I haven’t toyed too much with inbetween models  models that direct attention to locations that are predictive of the N most likely digits.
So what does it mean to “direct attention” in this model. Here, directing attention means that neurons covering “attended” locations are more active than neurons covering the unattended locations. I apply attention to the input of the second convolutional layer. The attentionally weighted signal passes through the second convolutional layer and passes onto the feature averaging layer. The feature averaging layer feeds to the fully connected layer, which then produces a final guess about what digit is present.
I first tested this model on the plain MNIST set. For testing, I wanted to compare my model to a model without attention. My comparison model is the same as the model with attention except that the attention directing signal is a matrix of ones  meaning that it doesn’t have any effect on the model’s activity. I use this comparison model because it has the same architecture as the model with attention.
I depict the results of my attentional and comparison models below. On the Xaxis is the test phase (10k trials) following each training epoch (60k trials). On the Yaxis is percent accuracy during the test phase. I did 3 training runs with both sets of models. All models gave fairly similar results, which led to small error bars (these depict standard error). The results are … dissapointing. As you can see both the model with attention and the comparison model perform similarly. There might be an initial impact of attention, but this impact is slight.
This result was a little dissapointing (since I’m an attention researcher and consider attention an important part of cognition), but it might not be so surprising given the task. If I gave you the task of naming digits, this task would be virtually effortless; probably so effortless that you would not have to pay very much attention to the task. You could probably talk on the phone or text while doing this task. Basically, I might have failed to find an effect of attention because this task is so easy that it does not require attention.
I decided to try my network when the task was a little more difficult. To make the task more difficult, I added random noise to each image (thank you to Nancy Kanwisher for the suggestion). This trick of adding noise to images is one that’s frequently done in psychophysical attention expeirments, so it would be fitting if it worked here.
The figure above depicts model performance on noisy images. The models are the as before, but this time the model with attention is far superior to the comparison model. Good news for attention researchers! This work suggests that visual attentional mechanisms similar to those in the brain may be beneficial in convolutional neural networks, and this effect is particularly strong with the images are noisy.
This work bears superficial similarity to recent language translation and question answering models. Models like the cited one report using a biologically inspired version of attention, and I agree they do, but they do not use attention in the same way that I am here. I believe this difference demonstrates a problem with what we call “attention.” Attention is not a single cognitive process. Instead, its a family of cognitive processes that we’ve simply given the same name. Thats not to say these forms of attention are completely distinct, but they likely involve different information transformations and probably even different brain regions.
]]>Much of this post reuses code from the previous posts, so I skim over some of the repeated code.
As usual, I will post all code as a jupyter notebook on my github.
1 2 3 4 5 6 7 

Load the data. Reminder  this data is available on my github.
1 2 3 4 5 6 7 8 

Load more data, and normalize the data for the PCA transformation.
1 2 3 4 5 6 7 8 

In the past I used kmeans to group players according to their performance (see my post on grouping players for more info). Here, I use a gaussian mixture model (GMM) to group the players. I use the GMM model because it assigns each player a “soft” label rather than a “hard” label. By soft label I mean that a player simultaneously belongs to several groups. For instance, Russell Westbrook belongs to both my “point guard” group and my “scorers” group. Kmeans uses hard labels where each player can only belong to one group. I think the GMM model provides a more accurate representation of players, so I’ve decided to use it in this post. Maybe in a future post I will spend more time describing it.
For anyone wondering, the GMM groupings looked pretty similar to the kmeans groupings.
1 2 3 4 5 6 7 8 9 10 11 12 13 

In this past I have attempted to predict win shares per 48 minutes. I am using win shares as a dependent variable again, but I want to categorize players.
Below I create a histogram of players’ win shares per 48.
I split players into 4 groups which I will refer to as “bad,” “below average,” “above average,” and “great”: Poor players are the bottom 10% in win shares per 48, Below average are the 1050th percentiles, Above average and 5090th percentiles, Great are the top 10%. This assignment scheme is relatively arbitrary; the model performs similarly with different assignment schemes.
1 2 3 4 5 6 7 8 

[0.096314496314496317,
0.40196560196560199,
0.39950859950859952,
0.10221130221130222]
My goal is to use rookie year performance to classify players into these 4 categories. I have a big matrix with lots of data about rookie year performance, but the reason that I grouped player using the GMM is because I suspect that players in the different groups have different “paths” to success. I am including the groupings in my classification model and computing interaction terms. The interaction terms will allow rookie performance to produce different predictions for the different groups.
By including interaction terms, I include quite a few predictor features. I’ve printed the number of predictor features and the number of predicted players below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

(1703, 1432)
(1703,)
Now that I have all the features, it’s time to try and predict which players will be poor, below average, above average, and great. To create these predictions, I will use a logistic regression model.
Because I have so many predictors, correlation between predicting features and overfitting the data are major concerns. I use regularization and crossvalidation to combat these issues.
Specifically, I am using l2 regularization and kfold 5 crossvalidation. Within the crossvalidation, I am trying to estimate how much regularization is appropriate.
Some important notes  I am using “balanced” weights which tells the model that worse to incorrectly predict the poor and great players than the below average and above average players. I do this because I don’t want the model to completely ignore the less frequent classifications. Second, I use the multi_class multinomial because it limits the number of models I have to fit.
1 2 3 4 5 6 7 8 9 

0.738109219025
Okay, the model did pretty well, but lets look at where the errors are coming from. To visualize the models accuracy, I am using a confusion matrix. In a confusion matrix, every item on the diagnonal is a correctly classified item. Every item off the diagonal is incorrectly classified. The color bar’s axis is the percent correct. So the dark blue squares represent cells with more items.
It seems the model is best at predicting poor players and great players. It makes more errors when trying to predict the more average players.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Lets look at what the model predicts for this year’s rookies. Below I modified two functions that I wrote for a previous post. The first function finds a particular year’s draft picks. The second function produces predictions for each draft pick.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Below I create a plot depicting the model’s predictions. On the yaxis are the four classifications. On the xaxis are the players from the 2015 draft. Each cell in the plot is the probability of a player belonging to one of the classifications. Again, dark blue means a cell or more likely. Good news for us TWolves fans! The model loves KAT.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

The data produced by sportsvu camera systems used to be freely available on NBA.com, but was recently removed (I have no idea why). Luckily, the data for about 600 games are available on neilmj’s github. In this post, I show how to create a video recreation of a given basketball play using the sportsvu data.
This code is also available as a jupyter notebook on my github.
1 2 3 4 5 

The data is provided as a json. Here’s how to import the python json library and load the data. I’m a TWolves fan, so the game I chose is a wolves game.
1 2 3 

Let’s take a quick look at the data. It’s a dictionary with three keys: gamedate, gameid, and events. Gamedate and gameid are the date of this game and its specific id number, respectively. Events is the structure with data we’re interested in.
1


[u'gamedate', u'gameid', u'events']
Lets take a look at the first event. The first event has an associated eventid number. We will use these later. There’s also data for each player on the visiting and home team. We will use these later too. Finally, and most importantly, there’s the “moments.” There are 25 moments for each second of the “event” (the data is sampled at 25hz).
1


[u'eventId', u'visitor', u'moments', u'home']
Here’s the first moment of the first event. The first number is the quarter. The second number is the time of the event in milliseconds. The third number is the number of seconds left in the quarter (the 1st quarter hasn’t started yet, so 12 * 60 = 720). The fourth number is the number of seconds left on the shot clock. I am not sure what fourth number (None) represents.
The final matrix is 11x5 matrix. The first row describes the ball. The first two columns are the teamID and the playerID of the ball (1 for both because the ball does not belong to a team and is not a player). The 3rd and 4th columns are xy coordinates of the ball. The final column is the height of the ball (z coordinate).
The next 10 rows describe the 10 players on the court. The first 5 players belong to the home team and the last 5 players belong to the visiting team. Each player has his teamID, playerID, xy&z coordinates (although I don’t think players’ z coordinates ever change).
1


[1,
1452903036782,
720.0,
24.0,
None,
[[1, 1, 44.16456, 26.34142, 5.74423],
[1610612760, 201142, 45.46259, 32.01456, 0.0],
[1610612760, 201566, 10.39347, 24.77219, 0.0],
[1610612760, 201586, 25.86087, 25.55881, 0.0],
[1610612760, 203460, 47.28525, 17.76225, 0.0],
[1610612760, 203500, 43.68634, 26.63098, 0.0],
[1610612750, 708, 55.6401, 25.55583, 0.0],
[1610612750, 2419, 47.95942, 31.66328, 0.0],
[1610612750, 201937, 67.28725, 25.10267, 0.0],
[1610612750, 203952, 47.28525, 17.76225, 0.0],
[1610612750, 1626157, 49.46814, 24.24193, 0.0]]]
Alright, so we have the sportsvu data, but its not clear what each event is. Luckily, the NBA also provides play by play (pbp) data. I write a function for acquiring play by play game data. This function collects (and trims) the play by play data for a given sportsvu data set.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

Below I show what the play by play data looks like. There’s a column for event number (eventnum). These event numbers match up with the event numbers from the sportsvu data, so we will use this later for seeking out specific plays in the sportsvu data. There’s a column for the event type (eventmsgtype). This column has a number describing what occured in the play. I list these number codes in the comments below.
There’s also short text descriptions of the plays in the home description and visitor description columns. Finally, I use the team column to represent the primary team involved in a play.
I stole the idea of using play by play data from Raji Shah.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

EVENTNUM  EVENTMSGTYPE  HOMEDESCRIPTION  VISITORDESCRIPTION  TEAM  

0  0  12  None  None  None 
1  1  10  Jump Ball Adams vs. Towns: Tip to Ibaka  None  OKC 
2  2  5  Westbrook Out of Bounds Lost Ball Turnover (P1...  None  OKC 
3  3  2  None  MISS Wiggins 16' Jump Shot  MIN 
4  4  4  Westbrook REBOUND (Off:0 Def:1)  None  OKC 
When viewing the videos, its nice to know what players are on the court. I like to depict this by labeling each player with their number. Here I create a dictionary that contains each player’s id number (these are assigned by nba.com) as the key and their jersey number as the associated value.
1 2 3 4 5 

Alright, almost there! Below I write some functions for creating the actual video! First, there’s a short function for placing an image of the basketball court beneath our depiction of players moving around. This image is from gmf05’s github, but I will provide it on mine too.
Much of this code is either straight from gmf05’s github or slightly modified.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

The event that I want to depict is event 41. In this event, Karl Anthony Towns misses a shot, grabs his own rebounds, and puts it back in.
1


EVENTNUM  EVENTMSGTYPE  HOMEDESCRIPTION  VISITORDESCRIPTION  TEAM  

37  41  1  None  Towns 1' Layup (2 PTS)  MIN 
We need to find where event 41 is in the sportsvu data structure, so I created a function for finding the location of a particular event. I then create a matrix with position data for the ball and a matrix with position data for each player for event 41.
1 2 3 4 5 6 7 8 9 10 11 12 

Okay. We’re actually there! Now we get to create the video. We have to create figure and axes objects for the animation to draw on. Then I place a picture of the basketball court on this plot. Finally, I create the circle and text objects that will move around throughout the video (depicting the ball and players). The location of these objects are then updated in the animation loop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

I’ve been told this video does not work for all users. I’ve also posted it on youtube.
]]>By depicting the shooting data continously, I lose the ability to represent one dimenion  I can no longer use the size of circles to depict shot frequency at a location. Nonetheless, I thought it would be fun to create these charts.
I explain how to create them below. I’ve also included the ability to compare a player’s shooting performance to the league average.
In my previous shot charts, I query nba.com’s API when creating a players shot chart, but querying nba.com’s API for every shot taken in 201516 takes a little while (for computing league average), so I’ve uploaded this data to my github and call the league data as a file rather than querying nba.com API.
This code is also available as a jupyter notebook on my github.
1 2 3 

Here, I create a function for querying shooting data from NBA.com’s API. This is the same function I used in my previous post regarding shot charts.
You can find a player’s ID number by going to the players nba.com page and looking at the page address. There is a python library that you can use for querying player IDs (and other data from the nba.com API), but I’ve found this library to be a little shaky.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 

Create a function for drawing the nba court. This function was taken directly from Savvas Tjortjoglou’s post on shot charts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 

Write a function for acquiring each player’s picture. This isn’t essential, but it makes things look nicer. This function takes a playerID number and the amount to zoom in on an image as the inputs. It by default places the image at the location 500,500.
1 2 3 4 5 6 7 8 9 

Here is where things get a little complicated. Below I write a function that divides the shooting data into a 25x25 matrix. Each shot taken within the xy coordinates encompassed by a given bin counts towards the shot count in that bin. In this way, the method I am using here is very similar to my previous hexbins (circles). So the difference just comes down to I present the data rather than how I preprocess it.
This function takes a dataframe with a vector of shot locations in the X plane, a vector with shot locations in the Y plane, a vector with shot type (2 pointer or 3 pointer), and a vector with ones for made shots and zeros for missed shots. The function by default bins the data into a 25x25 matrix, but the number of bins is editable. The 25x25 bins are then expanded to encompass a 500x500 space.
The output is a dictionary containing matrices for shots made, attempted, and points scored in each bin location. The dictionary also has the player’s ID number.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

Below I load the league average data. I also have the code that I used to originally download the data and to preprocess it.
1 2 3 4 5 6 7 8 9 

I really like playing with the different color maps, so here is a new color map I created for these shot charts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

Below, I write a function for creating the nba shot charts. The function takes a dictionary with martrices for shots attempted, made, and points scored. The matrices should be 500x500. By default, the shot chart depicts the number of shots taken across locations, but it can also depict the number of shots made, field goal percentage, and point scored across locations.
The function uses a gaussian kernel with standard deviation of 5 to smooth the data (make it look pretty). Again, this is editable. By default the function plots a players raw data, but it will plot how a player compares to league average if the input includes a matrix of league average data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 

Alright, thats that. Now lets create some plots. I am a twolves fan, so I will plot data from Karl Anthony Towns.
First, here is the default plot  attempts.
1 2 3 

Here’s KAT’s shots made
1 2 3 

Here’s field goal percentage. I don’t like this one too much. It’s hard to use similar scales for attempts and field goal percentage even though I’m using standard deviations rather than absolute scales.
1 2 3 

Here’s points across the court.
1 2 3 

Here’s how KAT’s attempts compare to the league average. You can see the twolve’s midrange heavy offense.
1 2 3 

How KAT’s shots made compares to league average.
1 2 3 

How KAT’s field goal percentage compares to league average. Again, the scale on these is not too good.
1 2 3 

And here is how KAT’s points compare to league average.
1 2 3 

When first created, 1layer neural networks brought about quite a bit of excitement, but this excitement quickly dissipated when researchers realized that 1layer neural networks could only solve a limited set of problems.
Researchers knew that adding an extra layer to the neural networks enabled neural networks to solve much more complex problems, but they didn’t know how to train these more complex networks.
In the previous post, I described “backpropogation,” but this wasn’t the portion of backpropogation that really changed the history of neural networks. What really changed neural networks is backpropogation with an extra layer. This extra layer enabled researchers to train more complex networks. The extra layer(s) is(are) called the hidden layer(s). In this post, I will describe backpropogation with a hidden layer.
To describe backpropogation with a hidden layer, I will demonstrate how neural networks can solve the XOR problem.
In this example of the XOR problem there are four items. Each item is defined by two values. If these two values are the same, then the item belongs to one group (blue here). If the two values are different, then the item belongs to another group (red here).
Below, I have depicted the XOR problem. The goal is to find a model that can distinguish between the blue and red groups based on an item’s values.
This code is also available as a jupyter notebook on my github.
1 2 3 4 5 6 7 8 9 10 

Again, each item has two values. An item’s first value is represented on the xaxis. An items second value is represented on the yaxis. The red items belong to one category and the blue items belong to another.
This is a nonlinear problem because no linear function can segregate the groups. For instance, a horizontal line could segregate the upper and lower items and a vertical line could segregate the left and right items, but no single linear function can segregate the red and blue items.
We need a nonlinear function to seperate the groups, and neural networks can emulate a nonlinear function that segregates them.
While this problem may seem relatively simple, it gave the initial neural networks quite a hard time. In fact, this is the problem that depleted much of the original enthusiasm for neural networks.
Neural networks can easily solve this problem, but they require an extra layer. Below I depict a network with an extra layer (a 2layer network). To depict the network, I use a repository available on my github.
1 2 3 4 5 6 7 

Notice that this network now has 5 total neurons. The two units at the bottom are the input layer. The activity of input units is the value of the inputs (same as the inputs in my previous post). The two units in the middle are the hidden layer. The activity of hidden units are calculated in the same manner as the output units from my previous post. The unit at the top is the output layer. The activity of this unit is found in the same manner as in my previous post, but the activity of the hidden units replaces the input units.
Thus, when the neural network makes its guess, the only difference is we have to compute an extra layer’s activity.
The goal of this network is for the output unit to have an activity of 0 when presented with an item from the blue group (inputs are same) and to have an activity of 1 when presented with an item from the red group (inputs are different).
One additional aspect of neural networks that I haven’t discussed is each noninput unit can have a bias. You can think about bias as a propensity for the unit to become active or not to become active. For instance, a unit with a postitive bias is more likely to be active than a unit with no bias.
I will implement bias as an extra line feeding into each unit. The weight of this line is the bias, and the bias line is always active, meaning this bias is always present.
Below, I seed this 3layer neural network with a random set of weights.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

Above we have out network. The depiction of and are confusing. 0.8 belongs to . 0.5 belongs to .
Lets go through one example of our network receiving an input and making a guess. Lets say the input is [0 1]. This means and . The correct answer in this case is 1.
First, we have to calculate ’s input. Remember we can write input as
with the a bias we can rewrite it as
Specifically for
Remember the first term in the equation above is the bias term. Lets see what this looks like in code.
1 2 3 

[1.27669634 1.07035845]
Note that by using np.dot, I can calculate both hidden unit’s input in a single line of code.
Next, we have to find the activity of units in the hidden layer.
I will translate input into activity with a logistic function, as I did in the previous post.
Lets see what this looks like in code.
1 2 3 4 5 

[ 0.2181131 0.25533492]
So far so good, the logistic function has transformed the negative inputs into values near 0.
Now we have to compute the output unit’s acitivity.
plugging in the numbers
Now the code for computing and the Output unit’s activity.
1 2 3 4 5 6 

net_Output
[0.66626595]
Output
[ 0.33933346]
Okay, thats the network’s guess for one input…. no where near the correct answer (1). Let’s look at what the network predicts for the other input patterns. Below I create a feedfoward, 1layer neural network and plot the neural nets’ guesses to the four input patterns.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

In the plot above, I have Input 1 on the xaxis and Input 2 on the yaxis. So if the Input is [0,0], the network produces the activity depicted in the lower left square. If the Input is [1,0], the network produces the activity depicted in the lower right square. If the network produces an output of 0, then the square will be blue. If the network produces an output of 1, then the square will be red. As you can see, the network produces all output between 0.25 and 0.5… no where near the correct answers.
So how do we update the weights in order to reduce the error between our guess and the correct answer?
First, we will do backpropogation between the output and hidden layers. This is exactly the same as backpropogation in the previous post.
In the previous post I described how our goal was to decrease error by changing the weights between units. This is the equation we used to describe changes in error with changes in the weights. The equation below expresses changes in error with changes to weights between the and the Output unit.
Now multiply this weight adjustment by the learning rate.
Finally, we apply the weight adjustment to .
Now lets do the same thing, but for both the weights and in the code.
1 2 3 4 5 6 7 8 

[[0.21252673 0.96033892 0.29229558]]
The hidden layer changes things when we do backpropogation. Above, we computed the new weights using the output unit’s error. Now, we want to find how adjusting a weight changes the error, but this weight connects an input to the hidden layer rather than connecting to the output layer. This means we have to propogate the error backwards to the hidden layer.
We will describe backpropogation for the line connecting and as
Pretty similar. We just replaced Output with . The interpretation (starting with the final term and moving left) is that changing the changes ’s input. Changing ’s input changes ’s activity. Changing ’s activity changes the error. This last assertion (the first term) is where things get complicated. Lets take a closer look at this first term
Changing ’s activity changes changes the input to the Output unit. Changing the output unit’s input changes the error. hmmmm still not quite there yet. Lets look at how changes to the output unit’s input changes the error.
You can probably see where this is going. Changing the output unit’s input changes the output unit’s activity. Changing the output unit’s activity changes error. There we go.
Okay, this got a bit heavy, but here comes some good news. Compare the two terms of the equation above to the first two terms of our original backpropogation equation. They’re the same! Now lets look at (the second term from the first equation after our new backpropogation equation).
Again, I am glossing over how to derive these partial derivatives. For a more complete explantion, I recommend Chapter 8 of Rumelhart and McClelland’s PDP book. Nonetheless, this means we can take the output of our function delta_output multiplied by and we have the first term of our backpropogation equation! We want to be the weight used in the forward pass. Not the updated weight.
The second two terms from our backpropogation equation are the same as in our original backpropogation equation.
 this is specific to logistic activation functions.
and
Lets try and write this out.
It’s not short, but its doable. Let’s plug in the numbers.
Not too bad. Now lets see the code.
1 2 3 4 

[[0.25119612 0.50149299 0.77809147]
[0.80193714 0.23946929 0.84467792]]
Alright! Lets implement all of this into a single model and train the model on the XOR problem. Below I create a neural network that includes both a forward pass and an optional backpropogation pass.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

Okay, thats the network. Below, I train the network until its answers are very close to the correct answer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Lets see how error changed across training
1 2 3 4 

Really cool. The network start with volatile error  sometimes being nearly correct ans sometimes being completely incorrect. Then After about 5000 iterations, the network starts down the slow path of perfecting an answer scheme. Below, I create a plot depicting the networks’ activity for the different input patterns.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Again, the Input 1 value is on the xaxis and the Input 2 value is on the yaxis. As you can see, the network guesses 1 when the inputs are different and it guesses 0 when the inputs are the same. Perfect! Below I depict the network with these correct weights.
1 2 3 4 5 6 7 8 9 10 

The network finds a pretty cool solution. Both hidden units are relatively active, but one hidden unit sends a strong postitive signal and the other sends a strong negative signal. The output unit has a negative bias, so if neither input is on, it will have an activity around 0. If both Input units are on, then the hidden unit that sends a postitive signal will be inhibited, and the output unit will have activity near 0. Otherwise, the hidden unit with a positive signal gives the output unit an acitivty near 1.
This is all well and good, but if you try to train this network with random weights you might find that it produces an incorrect set of weights sometimes. This is because the network runs into a local minima. A local minima is an instance when any change in the weights would increase the error, so the network is left with a suboptimal set of weights.
Below I handpick of set of weights that produce a local optima.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

Using these weights as the start of the training set, lets see what the network will do with training.
1 2 3 4 5 6 7 8 9 10 11 12 13 

As you can see the network never reduces error. Let’s see how the network answers to the different input patterns.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Looks like the network produces the correct answer in some cases but not others. The network is particularly confused when Inputs 2 is 0. Below I depict the weights after “training.” As you can see, they have not changed too much from where the weights started before training.
1 2 3 4 5 6 7 8 9 10 11 12 13 

This network was unable to push itself out of the local optima. While local optima are a problem, they’re are a couple things we can do to avoid them. First, we should always train a network multiple times with different random weights in order to test for local optima. If the network continually finds local optima, then we can increase the learning rate. By increasing the learning rate, the network can escape local optima in some cases. This should be done with care though as too big of a learning rate can also prevent finding the global minima.
Alright, that’s it. Obviously the neural network behind alpha go is much more complex than this one, but I would guess that while alpha go is much larger the basic computations underlying it are similar.
Hopefully these posts have given you an idea for how neural networks function and why they’re so cool!
]]>With the recent success of neural networks, I thought it would be useful to write a few posts describing the basics of neural networks.
First, what are neural networks  neural networks are a family of machine learning algorithms that can learn data’s underlying structure. Neural networks are composed of many neurons that perform simple computations. By performing many simple computations, neural networks can answer even the most complicated problems.
Lets get started.
As usual, I will post this code as a jupyter notebook on my github.
1 2 3 4 

When talking about neural networks, it’s nice to visualize the network with a figure. For drawing the neural networks, I forked a repository from miloharper and made some changes so that this repository could be imported into python and so that I could label the network. Here is my forked repository.
1 2 3 4 5 6 7 

Above is our neural network. It has two input neurons and a single output neuron. In this example, I’ll give the network an input of [0 1]. This means Input A will receive an input value of 0 and Input B will have an input value of 1.
The input is the input unit’s activity. This activity is sent to the Output unit, but the activity changes when traveling to the Output unit. The weights between the input and output units change the activity. A large positive weight between the input and output units causes the input unit to send a large positive (excitatory) signal. A large negative weight between the input and output units causes the input unit to send a large negative (inhibitory) signal. A weight near zero means the input unit does not influence the output unit.
In order to know the Output unit’s activity, we need to know its input. I will refer to the output unit’s input as . Here is how we can calculate
a more general way of writing this is
Let’s pretend the inputs are [0 1] and the Weights are [0.25 0.5]. Here is the input to the output neuron 
Thus, the input to the output neuron is 0.5. A quick way of programming this is through the function numpy.dot which finds the dot product of two vectors (or matrices). This might sound a little scary, but in this case its just multiplying the items by each other and then summing everything up  like we did above.
1 2 3 4 5 

0.5
All this is good, but we haven’t actually calculated the output unit’s activity we have only calculated its input. What makes neural networks able to solve complex problems is they include a nonlinearity when translating the input into activity. In this case we will translate the input into activity by putting the input through a logistic function.
1 2 

Lets take a look at a logistic function.
1 2 3 4 

As you can see above, the logistic used here transforms negative values into values near 0 and positive values into values near 1. Thus, when a unit receives a negative input it has activity near zero and when a unit receives a postitive input it has activity near 1. The most important aspect of this activation function is that its nonlinear  it’s not a straight line.
Now lets see the activity of our output neuron. Remember, the net input is 0.5
1 2 3 4 5 6 

0.622459331202
The activity of our output neuron is depicted as the red dot.
So far I’ve described how to find a unit’s activity, but I haven’t described how to find the weights of connections between units. In the example above, I chose the weights to be 0.25 and 0.5, but I can’t arbitrarily decide weights unless I already know the solution to the problem. If I want the network to find a solution for me, I need the network to find the weights itself.
In order to find the weights of connections between neurons, I will use an algorithm called backpropogation. In backpropogation, we have the neural network guess the answer to a problem and adjust the weights so that this guess gets closer and closer to the correct answer. Backpropogation is the method by which we reduce the distance between guesses and the correct answer. After many iterations of guesses by the neural network and weight adjustments through backpropogation, the network can learn an answer to a problem.
Lets say we want our neural network to give an answer of 0 when the left input unit is active and an answer of 1 when the right unit is active. In this case the inputs I will use are [1,0] and [0,1]. The corresponding correct answers will be [0] and [1].
Lets see how close our network is to the correct answer. I am using the weights from above ([0.25, 0.5]).
1 2 3 4 5 6 7 8 9 10 11 

[0.56217650088579807, 0.62245933120185459]
The guesses are in blue and the answers are in red. As you can tell, the guesses and the answers look almost nothing alike. Our network likes to guess around 0.6 while the correct answer is 0 in the first example and 1 in the second.
Lets look at how backpropogation reduces the distance between our guesses and the correct answers.
First, we want to know how the amount of error changes with an adjustment to a given weight. We can write this as
This change in error with changes in the weights has a number of different sub components.
Through the chain rule we know
This might look scary, but with a little thought it should make sense: (starting with the final term and moving left) When we change the weight of a connection to a unit, we change the input to that unit. When we change the input to a unit, we change its activity (written Output above). When we change a units activity, we change the amount of error.
Let’s break this down using our example. During this portion, I am going to gloss over some details about how exactly to derive the partial derivatives. Wikipedia has a more complete derivation.
In the first example, the input is [1,0] and the correct answer is [0]. Our network’s guess in this example was about 0.56.
Please note that this is specific to our example with a logistic activation function
To summarize:
This is the direction we want to move in, but taking large steps in this direction can prevent us from finding the optimal weights. For this reason, we reduce our step size. We will reduce our step size with a parameter called the learning rate (). is bound between 0 and 1.
Here is how we can write our change in weights
This is known as the delta rule.
We will set to be 0.5. Here is how we will calculate the new .
Thus, is shrinking which will move the output towards 0. Below I write the code to implement our backpropogation.
1 2 3 4 5 6 7 

Above I use the outer product of our delta function and the input in order to spread the weight changes to all lines connecting to the output unit.
Okay, hopefully you made it through that. I promise thats as bad as it gets. Now that we’ve gotten through the nasty stuff, lets use backpropogation to find an answer to our problem.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

It seems our code has found an answer, so lets see how the amount of error changed as the code progressed.
1 2 3 4 

It looks like the while loop excecuted about 1000 iterations before converging. As you can see the error decreases. Quickly at first then slowly as the weights zone in on the correct answer. lets see how our guesses compare to the correct answers.
1 2 3 4 5 6 7 8 9 10 11 

[array([ 0.05420561]), array([ 0.95020512])]
Not bad! Our guesses are much closer to the correct answers than before we started running the backpropogation procedure! Now, you might say, “HEY! But you haven’t reached the correct answers.” That’s true, but note that acheiving the values of 0 and 1 with a logistic function are only possible at  and , respectively. Because of this, we treat 0.05 as 0 and 0.95 as 1.
Okay, all this is great, but that was a really simple problem, and I said that neural networks could solve interesting problems!
Well… this post is already longer than I anticipated. I will followup this post with another post explaining how we can expand neural networks to solve more interesting problems.
]]>After my previous post, I started to get a little worried about my career prediction model. Specifically, I started to wonder about whether my model was underfitting or overfitting the data. Underfitting occurs when the model has too much “bias” and cannot accomodate the data’s shape. Overfitting occurs when the model is too flexible and can account for all variance in a data set  even variance due to noise. In this post, I will quickly recreate my player prediction model, and investigate whether underfitting and overfitting are a problem.
Because this post largely repeats a previous one, I haven’t written quite as much about the code. If you would like to read more about the code, see my previous posts.
As usual, I will post all code as a jupyter notebook on my github.
1 2 3 4 5 6 7 8 9 

Load the data. Reminder  this data is still available on my github.
1 2 3 4 5 6 7 8 

Load more data, and normalize it data for the PCA transformation.
1 2 3 4 5 6 7 8 

Use kmeans to group players according to their performance. See my post on grouping players for more info.
1 2 3 4 5 6 7 8 

Run a separate regression on each group of players. I calculate mean absolute error (a variant of mean squared error) for each model. I used mean absolute error because it’s on the same scale as the data, and easier to interpret. I will use this later to evaluate just how accurate these models are. Quick reminder  I am trying to predict career WS/48 with MANY predictor variables from rookie year performance such rebounding and scoring statistics.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 

More quick reminders  predicted performances are on the Yaxis, actual performances are on the Xaxis, and the red line is the identity line. Thus far, everything has been exactly the same as my previous post (although my group labels are different).
I want to investigate whether the model is overfitting the data. If the data is overfitting the data, then the error should go up when training and testing with different datasets (because the model was fitting itself to noise and noise changes when the datasets change). To investigate whether the model overfits the data, I will evaluate whether the model “generalizes” via crossvalidation.
The reason I’m worried about overfitting is I used a LOT of predictors in these models and the number of predictors might have allowed the model the model to fit noise in the predictors.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 

Group 0
Initial Mean Absolute Error: 0.0161
Cross Validation MAE: 0.0520
Group 1
Initial Mean Absolute Error: 0.0251
Cross Validation MAE: 0.0767
Group 2
Initial Mean Absolute Error: 0.0202
Cross Validation MAE: 0.0369
Group 3
Initial Mean Absolute Error: 0.0200
Cross Validation MAE: 0.0263
Group 4
Initial Mean Absolute Error: 0.0206
Cross Validation MAE: 0.0254
Group 5
Initial Mean Absolute Error: 0.0244
Cross Validation MAE: 0.0665
Above I print out the model’s initial mean absolute error and median absolute error when fitting crossvalidated data.
The models definitely have more error when cross validated. The change in error is worse in some groups than others. For instance, error dramatically increases in Group 1. Keep in mind that the scoring measure here is mean absolute error, so error is in the same scale as WS/48. An average error of 0.04 in WS/48 is sizable, leaving me worried that the models overfit the data.
Unfortunately, Group 1 is the “scorers” group, so the group with most the interesting players is where the model fails most…
Next, I will look into whether my models underfit the data. I am worried that my models underfit the data because I used linear regression, which has very little flexibility. To investigate this, I will plot the residuals of each model. Residuals are the error between my model’s prediction and the actual performance.
Linear regression assumes that residuals are uncorrelated and evenly distributed around 0. If this is not the case, then the linear regression is underfitting the data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Residuals are on the Yaxis and career performances are on the Xaxis. Negative residuals are over predictions (the player is worse than my model predicts) and postive residuals are under predictions (the player is better than my model predicts). I don’t test this, but the residuals appear VERY correlated. That is, the model tends to over estimate bad players (players with WS/48 less than 0.0) and under estimate good players. Just to clarify, noncorrelated residuals would have no apparent slope.
This means the model is making systematic errors and not fitting the actual shape of the data. I’m not going to say the model is damned, but this is an obvious sign that the model needs more flexibility.
No model is perfect, but this model definitely needs more work. I’ve been playing with more flexible models and will post these models here if they do a better job predicting player performance.
]]>Many have attempted to predict NBA players’ success via regression style approaches. Notable models I know of include Layne Vashro’s model which uses combine and college performance to predict career performance. Layne Vashro’s model is a quasipoisson GLM. I tried a similar approach, but had the most success when using ws/48 and OLS. I will discuss this a little more at the end of the post.
A jupyter notebook of this post can be found on my github.
1 2 3 4 5 6 7 8 9 

I collected all the data for this project from basketballreference.com. I posted the functions for collecting the data on my github. The data is also posted there. Beware, the data collection scripts take awhile to run.
This data includes per 36 stats and advanced statistics such as usage percentage. I simply took all the per 36 and advanced statistics from a player’s page on basketballreference.com.
1 2 

The variable I am trying to predict is average WS/48 over a player’s career. There’s no perfect boxscore statistic when it comes to quantifying a player’s peformance, but ws/48 seems relatively solid.
1 2 3 4 5 6 7 8 

The predicted variable looks pretty gaussian, so I can use ordinary least squares. This will be nice because while ols is not flexible, it’s highly interpretable. At the end of the post I’ll mention some more complex models that I will try.
1 2 3 4 5 6 

Above, I remove some predictors from the rookie data. Lets run the regression!
1 2 3 4 5 6 7 8 

OLS Regression Results
==============================================================================
Dep. Variable: WS/48 Rsquared: 0.476
Model: OLS Adj. Rsquared: 0.461
Method: Least Squares Fstatistic: 31.72
Date: Sun, 20 Mar 2016 Prob (Fstatistic): 2.56e194
Time: 15:29:43 LogLikelihood: 3303.9
No. Observations: 1690 AIC: 6512.
Df Residuals: 1642 BIC: 6251.
Df Model: 47
Covariance Type: nonrobust
==============================================================================
coef std err t P>t [95.0% Conf. Int.]

const 0.2509 0.078 3.223 0.001 0.098 0.404
x1 0.0031 0.001 6.114 0.000 0.004 0.002
x2 0.0004 9.06e05 4.449 0.000 0.001 0.000
x3 0.0003 8.12e05 3.525 0.000 0.000 0.000
x4 1.522e05 4.73e06 3.218 0.001 5.94e06 2.45e05
x5 0.0030 0.031 0.096 0.923 0.057 0.063
x6 0.0109 0.019 0.585 0.559 0.026 0.047
x7 0.0312 0.094 0.331 0.741 0.216 0.154
x8 0.0161 0.027 0.594 0.553 0.037 0.069
x9 0.0054 0.018 0.292 0.770 0.041 0.031
x10 0.0012 0.007 0.169 0.866 0.013 0.015
x11 0.0136 0.023 0.592 0.554 0.031 0.059
x12 0.0099 0.018 0.538 0.591 0.046 0.026
x13 0.0076 0.054 0.141 0.888 0.098 0.113
x14 0.0094 0.012 0.783 0.433 0.014 0.033
x15 0.0029 0.002 1.361 0.174 0.001 0.007
x16 0.0078 0.009 0.861 0.390 0.010 0.026
x17 0.0107 0.019 0.573 0.567 0.047 0.026
x18 0.0062 0.018 0.342 0.732 0.042 0.029
x19 0.0095 0.017 0.552 0.581 0.024 0.043
x20 0.0111 0.004 2.853 0.004 0.003 0.019
x21 0.0109 0.018 0.617 0.537 0.024 0.046
x22 0.0139 0.006 2.165 0.030 0.026 0.001
x23 0.0024 0.005 0.475 0.635 0.008 0.012
x24 0.0022 0.001 1.644 0.100 0.000 0.005
x25 0.0125 0.012 1.027 0.305 0.036 0.011
x26 0.0006 0.000 1.782 0.075 0.001 5.74e05
x27 0.0011 0.001 1.749 0.080 0.002 0.000
x28 0.0012 0.003 0.487 0.626 0.004 0.006
x29 0.1824 0.089 2.059 0.040 0.009 0.356
x30 0.0288 0.025 1.153 0.249 0.078 0.020
x31 0.0128 0.011 1.206 0.228 0.034 0.008
x32 0.0046 0.008 0.603 0.547 0.020 0.010
x33 0.0071 0.005 1.460 0.145 0.017 0.002
x34 0.0131 0.012 1.124 0.261 0.010 0.036
x35 0.0023 0.001 2.580 0.010 0.004 0.001
x36 0.0077 0.013 0.605 0.545 0.033 0.017
x37 0.0069 0.004 1.916 0.055 0.000 0.014
x38 0.0015 0.001 2.568 0.010 0.003 0.000
x39 0.0002 0.002 0.110 0.912 0.005 0.004
x40 0.0109 0.017 0.632 0.528 0.045 0.023
x41 0.0142 0.017 0.821 0.412 0.048 0.020
x42 0.0217 0.017 1.257 0.209 0.012 0.056
x43 0.0123 0.102 0.121 0.904 0.188 0.213
x44 0.0441 0.018 2.503 0.012 0.010 0.079
x45 0.0406 0.018 2.308 0.021 0.006 0.075
x46 0.0410 0.018 2.338 0.020 0.075 0.007
x47 0.0035 0.003 1.304 0.192 0.002 0.009
==============================================================================
Omnibus: 42.820 DurbinWatson: 1.966
Prob(Omnibus): 0.000 JarqueBera (JB): 54.973
Skew: 0.300 Prob(JB): 1.16e12
Kurtosis: 3.649 Cond. No. 1.88e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.88e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
There’s a lot to look at in the regression output (especially with this many features). For an explanation of all the different parts of the regression take a look at this post. Below is a quick plot of predicted ws/48 against actual ws/48.
1 2 3 4 

The blue line above is NOT the bestfit line. It’s the identity line. I plot it to help visualize where the model fails. The model seems to primarily fail in the extremes  it tends to overestimate the worst players.
All in all, This model does a remarkably good job given its simplicity (linear regression), but it also leaves a lot of variance unexplained.
One reason this model might miss some variance is there’s more than one way to be a productive basketball player. For instance, Dwight Howard and Steph Curry find very different ways to contribute. One linear regression model is unlikely to succesfully predict both players.
In a previous post, I grouped players according to their oncourt performance. These player groupings might help predict career performance.
Below, I will use the same player grouping I developed in my previous post, and examine how these groupings impact my ability to predict career performance.
1 2 3 4 5 6 7 8 

1 2 3 4 5 6 7 8 

See my other post for more details about this clustering procedure.
Let’s see how WS/48 varies across the groups.
1 2 

Some groups perform better than others, but there’s lots of overlap between the groups. Importantly, each group has a fair amount of variability. Each group spans at least 0.15 WS/48. This gives the regression enough room to successfully predict performance in each group.
Now, lets get a bit of a refresher on what the groups are. Again, my previous post has a good description of these groups.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

I’ve plotted the groups across a number of useful categories. For information about these categories see basketball reference’s glossary.
Here’s a quick rehash of the groupings. See my previous post for more detail.
On to the regression.
1 2 3 4 5 6 7 8 

You might have noticed the giant condition number in the regression above. This indicates significant multicollinearity of the features, which isn’t surprising since I have many features that reflect the same abilities.
The multicollinearity doesn’t prevent the regression model from making accurate predictions, but does it make the beta weight estimates irratic. With irratic beta weights, it’s hard to tell whether the different clusters use different models when predicting career ws/48.
In the following regression, I put the predicting features through a PCA and keep only the first 10 PCA components. Using only the first 10 PCA components keeps the component score below 20, indicating that multicollinearity is not a problem. I then examine whether the different groups exhibit a different patterns of beta weights (whether different models predict success of the different groups).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

1 2 3 4 5 6 7 8 9 

Above I plot the beta weights for each principle component across the groupings. This plot is a lot to look at, but I wanted to depict how the beta values changed across the groups. They are not drastically different, but they’re also not identical. Error bars depict 95% confidence intervals.
Below I fit a regression to each group, but with all the features. Again, multicollinearity will be a problem, but this will not decrease the regression’s accuracy, which is all I really care about.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

The plots above depict each regression’s predictions against actual ws/48. I provide each model’s r^2 in the plot too.
Some regressions are better than others. For instance, the regression model does a pretty awesome job predicting the bench warmers…I wonder if this is because they have shorter careers… The regression model does not do a good job predicting the 3point shooters.
Now onto the fun stuff though.
Below, create a function for predicting a players career WS/48. First, I write a function that finds what cluster a player would belong to, and what the regression model predicts for this players career (with 95% confidence intervals).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Here I create a function that creates a list of all the first round draft picks from a given year.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

Below I create predictions for each firstround draft pick from 2015. The spurs’ first round pick, Nikola Milutinov, has yet to play so I do not create a prediction for him.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

The plot above is ordered by draft pick. The error bars depict 95% confidence interbals…which are a little wider than I would like. It’s interesting to look at what clusters these players fit into. Lots of 3pt shooters! It could be that rookies play a limited role in the offense  just shooting 3s.
As a twolves fan, I am relatively happy about the high prediction for KarlAnthony Towns. His predicted ws/48 is between Marc Gasol and Elton Brand. Again, the CIs are quite wide, so the model says there’s a 95% chance he is somewhere between Lebron James ever and a player that averages less than 0.1 ws/48.
KarlAnthony Towns would have the highest predicted ws/48 if it were not for Kevin Looney who the model loves. Kevin Looney has not seen much playing time though, which likely makes his prediction more erratic. Keep in mind I did not use draft position as a predictor in the model.
Sam Dekker has a pretty huge error bar, likely because of his limited playing time this year.
While I fed a ton of features into this model, it’s still just a linear regression. The simplicity of the model might prevent me from making more accurate predictions.
I’ve already started playing with some more complex models. If those work out well, I will post them here. I ended up sticking with a plain linear regression because my vast number of features is a little unwieldy in a more complex models. If you’re interested (and the models produce better results) check back in the future.
For now, these models explain between 40 and 70% of the variance in career ws/48 from only a player’s rookie year. Even predicting 30% of variance is pretty remarkable, so I don’t want to trash on this part of the model. Explaining 65% of the variance is pretty awesome. The model gives us a pretty accurate idea of how these “bench players” will perform. For instance, the future does not look bright for players like Emmanuel Mudiay and Tyus Jones. Not to say these players are doomed. The model assumes that players will retain their grouping for the entire career. Emmanuel Mudiay and Tyus Jones might start performing more like distributors as their career progresses. This could result in a better career.
One nice part about this model is it tells us where the predictions are less confident. For instance, it is nice to know that we’re relatively confident when predicting bench players, but not when we’re predicting 3point shooters.
For those curious, I output each groups regression summary below.
1


OLS Regression Results
==============================================================================
Dep. Variable: WS/48 Rsquared: 0.648
Model: OLS Adj. Rsquared: 0.575
Method: Least Squares Fstatistic: 8.939
Date: Sun, 20 Mar 2016 Prob (Fstatistic): 2.33e24
Time: 10:40:28 LogLikelihood: 493.16
No. Observations: 212 AIC: 912.3
Df Residuals: 175 BIC: 788.1
Df Model: 36
Covariance Type: nonrobust
==============================================================================
coef std err t P>t [95.0% Conf. Int.]

const 0.1072 0.064 1.682 0.094 0.233 0.019
x1 0.0012 0.001 0.925 0.356 0.001 0.004
x2 0.0005 0.000 2.355 0.020 0.001 7.53e05
x3 0.0005 0.000 1.899 0.059 0.001 2.03e05
x4 3.753e05 1.27e05 2.959 0.004 1.25e05 6.26e05
x5 0.1152 0.088 1.315 0.190 0.288 0.058
x6 0.0240 0.053 0.456 0.649 0.080 0.128
x7 0.4318 0.372 1.159 0.248 1.167 0.303
x8 0.0089 0.085 0.105 0.917 0.159 0.177
x9 0.0479 0.054 0.893 0.373 0.154 0.058
x10 0.0055 0.021 0.265 0.792 0.046 0.035
x11 0.0011 0.076 0.015 0.988 0.152 0.149
x12 0.0301 0.053 0.569 0.570 0.134 0.074
x13 0.7814 0.270 2.895 0.004 0.249 1.314
x14 0.0323 0.028 1.159 0.248 0.087 0.023
x15 0.0108 0.007 1.451 0.149 0.025 0.004
x16 0.0202 0.030 0.676 0.500 0.079 0.039
x17 0.0461 0.039 1.172 0.243 0.124 0.032
x18 0.0178 0.040 0.443 0.659 0.097 0.062
x19 0.0450 0.038 1.178 0.240 0.030 0.121
x20 0.0354 0.014 2.527 0.012 0.008 0.063
x21 0.0418 0.044 0.947 0.345 0.129 0.045
x22 0.0224 0.015 1.448 0.150 0.053 0.008
x23 0.0158 0.008 2.039 0.043 0.031 0.001
x24 0.0058 0.001 4.261 0.000 0.003 0.009
x25 0.0577 0.027 2.112 0.036 0.004 0.112
x26 0.1913 0.267 0.718 0.474 0.717 0.335
x27 0.0050 0.093 0.054 0.957 0.189 0.179
x28 0.0133 0.039 0.344 0.731 0.090 0.063
x29 0.0071 0.015 0.480 0.632 0.036 0.022
x30 0.0190 0.010 1.973 0.050 0.038 5.68e06
x31 0.0221 0.023 0.951 0.343 0.024 0.068
x32 0.0083 0.003 2.490 0.014 0.015 0.002
x33 0.0386 0.031 1.259 0.210 0.022 0.099
x34 0.0153 0.008 1.819 0.071 0.001 0.032
x35 1.734e05 0.001 0.014 0.989 0.002 0.002
x36 0.0033 0.004 0.895 0.372 0.004 0.011
==============================================================================
Omnibus: 2.457 DurbinWatson: 2.144
Prob(Omnibus): 0.293 JarqueBera (JB): 2.475
Skew: 0.007 Prob(JB): 0.290
Kurtosis: 3.529 Cond. No. 1.78e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.78e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS Regression Results
==============================================================================
Dep. Variable: WS/48 Rsquared: 0.443
Model: OLS Adj. Rsquared: 0.340
Method: Least Squares Fstatistic: 4.307
Date: Sun, 20 Mar 2016 Prob (Fstatistic): 1.67e11
Time: 10:40:28 LogLikelihood: 447.99
No. Observations: 232 AIC: 822.0
Df Residuals: 195 BIC: 694.4
Df Model: 36
Covariance Type: nonrobust
==============================================================================
coef std err t P>t [95.0% Conf. Int.]

const 0.0532 0.090 0.594 0.553 0.230 0.124
x1 0.0020 0.002 1.186 0.237 0.005 0.001
x2 0.0006 0.000 1.957 0.052 0.001 4.47e06
x3 0.0007 0.000 2.559 0.011 0.001 0.000
x4 5.589e05 1.39e05 4.012 0.000 2.84e05 8.34e05
x5 0.0386 0.093 0.414 0.679 0.145 0.222
x6 0.0721 0.051 1.407 0.161 0.173 0.029
x7 0.6259 0.571 1.097 0.274 1.751 0.499
x8 0.0653 0.079 0.822 0.412 0.222 0.091
x9 0.0756 0.051 1.485 0.139 0.025 0.176
x10 0.0046 0.031 0.149 0.881 0.066 0.057
x11 0.0365 0.066 0.554 0.580 0.166 0.093
x12 0.0679 0.051 1.332 0.185 0.033 0.169
x13 0.0319 0.183 0.174 0.862 0.329 0.393
x14 0.0106 0.040 0.262 0.793 0.069 0.090
x15 0.0232 0.017 1.357 0.176 0.057 0.011
x16 0.1121 0.039 2.869 0.005 0.189 0.035
x17 0.0675 0.060 1.134 0.258 0.185 0.050
x18 0.0314 0.059 0.536 0.593 0.147 0.084
x19 0.0266 0.055 0.487 0.627 0.081 0.134
x20 0.0259 0.009 2.827 0.005 0.008 0.044
x21 0.0155 0.050 0.307 0.759 0.115 0.084
x22 0.1170 0.051 2.281 0.024 0.016 0.218
x23 0.0157 0.014 1.102 0.272 0.044 0.012
x24 0.0021 0.003 0.732 0.465 0.003 0.008
x25 0.0012 0.038 0.032 0.974 0.077 0.075
x26 0.8379 0.524 1.599 0.111 0.196 1.871
x27 0.0511 0.113 0.454 0.651 0.273 0.171
x28 0.0944 0.111 0.852 0.395 0.124 0.313
x29 0.0018 0.029 0.061 0.951 0.059 0.055
x30 0.0167 0.017 0.969 0.334 0.051 0.017
x31 0.0377 0.044 0.854 0.394 0.049 0.125
x32 0.0052 0.002 2.281 0.024 0.010 0.001
x33 0.0132 0.037 0.360 0.719 0.059 0.086
x34 0.0650 0.028 2.356 0.019 0.119 0.011
x35 0.0012 0.002 0.668 0.505 0.005 0.002
x36 0.0087 0.008 1.107 0.270 0.007 0.024
==============================================================================
Omnibus: 2.161 DurbinWatson: 2.000
Prob(Omnibus): 0.339 JarqueBera (JB): 1.942
Skew: 0.222 Prob(JB): 0.379
Kurtosis: 3.067 Cond. No. 3.94e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.94e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS Regression Results
==============================================================================
Dep. Variable: WS/48 Rsquared: 0.358
Model: OLS Adj. Rsquared: 0.270
Method: Least Squares Fstatistic: 4.050
Date: Sun, 20 Mar 2016 Prob (Fstatistic): 1.93e11
Time: 10:40:28 LogLikelihood: 645.12
No. Observations: 298 AIC: 1216.
Df Residuals: 261 BIC: 1079.
Df Model: 36
Covariance Type: nonrobust
==============================================================================
coef std err t P>t [95.0% Conf. Int.]

const 0.0306 0.040 0.763 0.446 0.048 0.110
x1 0.0013 0.001 1.278 0.202 0.003 0.001
x2 0.0003 0.000 1.889 0.060 0.001 1.39e05
x3 0.0002 0.000 1.196 0.233 0.001 0.000
x4 2.388e05 8.83e06 2.705 0.007 6.5e06 4.13e05
x5 0.0643 0.089 0.724 0.470 0.239 0.111
x6 0.0131 0.046 0.286 0.775 0.077 0.103
x7 0.4703 0.455 1.034 0.302 1.366 0.426
x8 0.0194 0.089 0.219 0.827 0.155 0.194
x9 0.0330 0.052 0.638 0.524 0.135 0.069
x10 0.0221 0.013 1.754 0.081 0.047 0.003
x11 0.0161 0.074 0.216 0.829 0.130 0.162
x12 0.0228 0.047 0.489 0.625 0.115 0.069
x13 0.2619 0.423 0.620 0.536 0.570 1.094
x14 0.0303 0.027 1.136 0.257 0.083 0.022
x15 0.0023 0.003 0.895 0.372 0.007 0.003
x16 0.0005 0.023 0.021 0.983 0.045 0.046
x17 0.0206 0.040 0.513 0.608 0.059 0.100
x18 0.0507 0.040 1.271 0.205 0.028 0.129
x19 0.0349 0.037 0.942 0.347 0.108 0.038
x20 0.0210 0.017 1.252 0.212 0.012 0.054
x21 0.0400 0.041 0.964 0.336 0.042 0.122
x22 0.0239 0.009 2.530 0.012 0.042 0.005
x23 0.0140 0.008 1.683 0.094 0.030 0.002
x24 0.0045 0.001 4.594 0.000 0.003 0.006
x25 0.0264 0.026 1.004 0.316 0.025 0.078
x26 0.2730 0.169 1.615 0.107 0.060 0.606
x27 0.0208 0.187 0.111 0.912 0.389 0.348
x28 0.0007 0.015 0.051 0.959 0.029 0.028
x29 0.0168 0.018 0.917 0.360 0.019 0.053
x30 0.0059 0.011 0.524 0.601 0.016 0.028
x31 0.0196 0.028 0.711 0.478 0.074 0.035
x32 0.0035 0.004 0.899 0.370 0.011 0.004
x33 0.0246 0.029 0.858 0.392 0.081 0.032
x34 0.0145 0.005 2.903 0.004 0.005 0.024
x35 0.0017 0.001 1.442 0.150 0.004 0.001
x36 0.0069 0.005 1.514 0.131 0.002 0.016
==============================================================================
Omnibus: 5.509 DurbinWatson: 1.845
Prob(Omnibus): 0.064 JarqueBera (JB): 5.309
Skew: 0.272 Prob(JB): 0.0703
Kurtosis: 3.362 Cond. No. 3.70e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.7e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS Regression Results
==============================================================================
Dep. Variable: WS/48 Rsquared: 0.304
Model: OLS Adj. Rsquared: 0.248
Method: Least Squares Fstatistic: 5.452
Date: Sun, 20 Mar 2016 Prob (Fstatistic): 4.41e19
Time: 10:40:28 LogLikelihood: 1030.4
No. Observations: 486 AIC: 1987.
Df Residuals: 449 BIC: 1832.
Df Model: 36
Covariance Type: nonrobust
==============================================================================
coef std err t P>t [95.0% Conf. Int.]

const 0.1082 0.033 3.280 0.001 0.043 0.173
x1 0.0018 0.001 2.317 0.021 0.003 0.000
x2 0.0005 0.000 3.541 0.000 0.001 0.000
x3 4.431e05 0.000 0.359 0.720 0.000 0.000
x4 1.71e05 6.08e06 2.813 0.005 5.15e06 2.9e05
x5 0.0257 0.044 0.580 0.562 0.061 0.113
x6 0.0133 0.029 0.464 0.643 0.043 0.070
x7 0.5271 0.357 1.476 0.141 1.229 0.175
x8 0.0415 0.038 1.090 0.277 0.033 0.116
x9 0.0117 0.029 0.409 0.682 0.068 0.044
x10 0.0031 0.018 0.171 0.865 0.032 0.038
x11 0.0253 0.031 0.819 0.413 0.035 0.086
x12 0.0196 0.028 0.687 0.492 0.076 0.036
x13 0.0360 0.067 0.535 0.593 0.096 0.168
x14 0.0096 0.021 0.461 0.645 0.031 0.050
x15 0.0101 0.009 1.165 0.245 0.007 0.027
x16 0.0227 0.015 1.556 0.120 0.006 0.051
x17 0.0413 0.034 1.198 0.232 0.026 0.109
x18 0.0195 0.031 0.623 0.533 0.042 0.081
x19 0.0267 0.029 0.906 0.366 0.085 0.031
x20 0.0199 0.008 2.652 0.008 0.005 0.035
x21 0.0442 0.033 1.325 0.186 0.110 0.021
x22 0.0232 0.025 0.946 0.345 0.025 0.072
x23 0.0085 0.009 0.976 0.330 0.009 0.026
x24 0.0025 0.001 1.782 0.075 0.000 0.005
x25 0.0200 0.019 1.042 0.298 0.058 0.018
x26 0.4937 0.331 1.491 0.137 0.157 1.144
x27 0.1406 0.074 1.907 0.057 0.286 0.004
x28 0.0638 0.049 1.304 0.193 0.160 0.032
x29 0.0252 0.015 1.690 0.092 0.055 0.004
x30 0.0217 0.008 2.668 0.008 0.038 0.006
x31 0.0483 0.020 2.387 0.017 0.009 0.088
x32 0.0036 0.002 2.159 0.031 0.007 0.000
x33 0.0388 0.023 1.681 0.094 0.007 0.084
x34 0.0105 0.011 0.923 0.357 0.033 0.012
x35 0.0028 0.001 1.966 0.050 0.006 1.59e06
x36 0.0017 0.003 0.513 0.608 0.008 0.005
==============================================================================
Omnibus: 5.317 DurbinWatson: 2.030
Prob(Omnibus): 0.070 JarqueBera (JB): 5.115
Skew: 0.226 Prob(JB): 0.0775
Kurtosis: 3.221 Cond. No. 4.51e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.51e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS Regression Results
==============================================================================
Dep. Variable: WS/48 Rsquared: 0.455
Model: OLS Adj. Rsquared: 0.378
Method: Least Squares Fstatistic: 5.852
Date: Sun, 20 Mar 2016 Prob (Fstatistic): 4.77e18
Time: 10:40:28 LogLikelihood: 631.81
No. Observations: 289 AIC: 1190.
Df Residuals: 252 BIC: 1054.
Df Model: 36
Covariance Type: nonrobust
==============================================================================
coef std err t P>t [95.0% Conf. Int.]

const 0.1755 0.096 1.827 0.069 0.014 0.365
x1 0.0031 0.001 2.357 0.019 0.006 0.001
x2 0.0005 0.000 2.424 0.016 0.001 8.68e05
x3 0.0003 0.000 2.154 0.032 0.001 2.9e05
x4 2.374e05 8.35e06 2.842 0.005 7.29e06 4.02e05
x5 0.0391 0.070 0.556 0.579 0.099 0.177
x6 0.0672 0.040 1.662 0.098 0.012 0.147
x7 0.9503 0.458 2.075 0.039 0.048 1.852
x8 0.0013 0.061 0.021 0.983 0.122 0.119
x9 0.0270 0.041 0.659 0.510 0.108 0.054
x10 0.0072 0.017 0.426 0.671 0.041 0.026
x11 0.0604 0.056 1.083 0.280 0.049 0.170
x12 0.0723 0.041 1.782 0.076 0.152 0.008
x13 1.2499 0.392 3.186 0.002 2.022 0.477
x14 0.0502 0.028 1.776 0.077 0.005 0.106
x15 0.0048 0.011 0.456 0.649 0.016 0.026
x16 0.0637 0.042 1.530 0.127 0.146 0.018
x17 0.0042 0.038 0.112 0.911 0.070 0.078
x18 0.0318 0.038 0.830 0.408 0.044 0.107
x19 0.0220 0.037 0.602 0.548 0.094 0.050
x20 4.535e05 0.009 0.005 0.996 0.018 0.018
x21 0.0176 0.040 0.440 0.660 0.097 0.061
x22 0.0244 0.021 1.182 0.238 0.065 0.016
x23 0.0135 0.012 1.128 0.260 0.010 0.037
x24 0.0024 0.002 1.355 0.177 0.001 0.006
x25 0.0418 0.026 1.583 0.115 0.094 0.010
x26 0.3619 0.328 1.105 0.270 0.283 1.007
x27 0.0090 0.186 0.049 0.961 0.358 0.376
x28 0.0613 0.057 1.068 0.286 0.174 0.052
x29 0.0124 0.016 0.779 0.436 0.019 0.044
x30 0.0042 0.011 0.379 0.705 0.018 0.026
x31 0.0108 0.026 0.412 0.681 0.062 0.041
x32 0.0014 0.002 0.588 0.557 0.003 0.006
x33 0.0195 0.029 0.672 0.502 0.038 0.077
x34 0.0168 0.011 1.554 0.121 0.004 0.038
x35 0.0026 0.002 1.227 0.221 0.007 0.002
x36 0.0072 0.004 1.958 0.051 0.014 4.02e05
==============================================================================
Omnibus: 4.277 DurbinWatson: 1.995
Prob(Omnibus): 0.118 JarqueBera (JB): 4.056
Skew: 0.226 Prob(JB): 0.132
Kurtosis: 3.364 Cond. No. 4.24e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.24e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS Regression Results
==============================================================================
Dep. Variable: WS/48 Rsquared: 0.476
Model: OLS Adj. Rsquared: 0.337
Method: Least Squares Fstatistic: 3.431
Date: Sun, 20 Mar 2016 Prob (Fstatistic): 1.19e07
Time: 10:40:28 LogLikelihood: 330.36
No. Observations: 173 AIC: 586.7
Df Residuals: 136 BIC: 470.1
Df Model: 36
Covariance Type: nonrobust
==============================================================================
coef std err t P>t [95.0% Conf. Int.]

const 0.1822 0.262 0.696 0.488 0.335 0.700
x1 0.0011 0.002 0.491 0.624 0.005 0.003
x2 0.0001 0.000 0.310 0.757 0.001 0.001
x3 6.743e05 0.000 0.220 0.827 0.001 0.001
x4 5.819e06 1.63e05 0.357 0.722 2.65e05 3.81e05
x5 0.0618 0.122 0.507 0.613 0.179 0.303
x6 0.0937 0.074 1.272 0.206 0.052 0.240
x7 0.8422 0.919 0.917 0.361 0.975 2.659
x8 0.1109 0.111 1.001 0.319 0.330 0.108
x9 0.1334 0.075 1.767 0.079 0.283 0.016
x10 0.0357 0.024 1.500 0.136 0.083 0.011
x11 0.1373 0.103 1.335 0.184 0.341 0.066
x12 0.1002 0.075 1.329 0.186 0.249 0.049
x13 0.2963 0.616 0.481 0.631 1.515 0.922
x14 0.0278 0.047 0.588 0.557 0.121 0.066
x15 0.0099 0.015 0.661 0.510 0.040 0.020
x16 0.1532 0.106 1.444 0.151 0.057 0.363
x17 0.1569 0.072 2.168 0.032 0.300 0.014
x18 0.1633 0.068 2.385 0.018 0.299 0.028
x19 0.1550 0.066 2.356 0.020 0.025 0.285
x20 0.0114 0.017 0.688 0.492 0.044 0.021
x21 0.0130 0.076 0.170 0.865 0.164 0.138
x22 0.0202 0.024 0.857 0.393 0.067 0.026
x23 0.0203 0.028 0.737 0.462 0.075 0.034
x24 0.0023 0.004 0.608 0.544 0.010 0.005
x25 0.0546 0.048 1.141 0.256 0.040 0.149
x26 1.0180 0.714 1.426 0.156 2.430 0.394
x27 0.3371 0.203 1.664 0.098 0.064 0.738
x28 0.1286 0.140 0.916 0.361 0.149 0.406
x29 0.0561 0.035 1.607 0.110 0.125 0.013
x30 0.0535 0.020 2.645 0.009 0.093 0.013
x31 0.1169 0.051 2.305 0.023 0.017 0.217
x32 0.0039 0.004 1.030 0.305 0.004 0.011
x33 0.0179 0.055 0.324 0.746 0.091 0.127
x34 0.0081 0.013 0.632 0.529 0.017 0.033
x35 0.0013 0.006 0.229 0.819 0.010 0.013
x36 0.0068 0.007 1.045 0.298 0.020 0.006
==============================================================================
Omnibus: 2.969 DurbinWatson: 2.098
Prob(Omnibus): 0.227 JarqueBera (JB): 2.526
Skew: 0.236 Prob(JB): 0.283
Kurtosis: 3.357 Cond. No. 6.96e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.96e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
]]>In this post I show how to execute a repeated measures ANOVAs using the rpy2 library, which allows us to move data between python and R, and execute R commands from python. I use rpy2 to load the car library and run the ANOVA.
I will show how to run a oneway repeated measures ANOVA and a twoway repeated measures ANOVA.
1 2 3 4 5 6 7 8 9 10 

Below I use the random library to generate some fake data. I seed the random number generator with a one so that this analysis can be replicated.
I will generated 3 conditions which represent 3 levels of a single variable.
The data are generated from a gaussian distribution. The second condition has a higher mean than the other two conditions.
1 2 3 4 5 6 7 8 9 

Next, I load rpy2 for ipython. I am doing these analyses with ipython in a jupyter notebook (highly recommended).
1


Here’s how to run the ANOVA. Note that this is a oneway anova with 3 levels of the factor.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

Type III Repeated Measures MANOVA Tests:

Term: (Intercept)
Response transformation matrix:
(Intercept)
cond_1 1
cond_2 1
cond_3 1
Sum of squares and products for the hypothesis:
(Intercept)
(Intercept) 102473990
Sum of squares and products for error:
(Intercept)
(Intercept) 78712.7
Multivariate Tests: (Intercept)
Df test stat approx F num Df den Df Pr(>F)
Pillai 1 0.9992 37754.33 1 29 < 2.22e16 ***
Wilks 1 0.0008 37754.33 1 29 < 2.22e16 ***
HotellingLawley 1 1301.8736 37754.33 1 29 < 2.22e16 ***
Roy 1 1301.8736 37754.33 1 29 < 2.22e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Term: Factor
Response transformation matrix:
Factor1 Factor2
cond_1 1 0
cond_2 0 1
cond_3 1 1
Sum of squares and products for the hypothesis:
Factor1 Factor2
Factor1 3679.584 19750.87
Factor2 19750.870 106016.58
Sum of squares and products for error:
Factor1 Factor2
Factor1 40463.19 27139.59
Factor2 27139.59 51733.12
Multivariate Tests: Factor
Df test stat approx F num Df den Df Pr(>F)
Pillai 1 0.7152596 35.16759 2 28 2.303e08 ***
Wilks 1 0.2847404 35.16759 2 28 2.303e08 ***
HotellingLawley 1 2.5119704 35.16759 2 28 2.303e08 ***
Roy 1 2.5119704 35.16759 2 28 2.303e08 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Univariate Type III RepeatedMeasures ANOVA Assuming Sphericity
SS num Df Error SS den Df F Pr(>F)
(Intercept) 34157997 1 26238 29 37754.334 < 2.2e16 ***
Factor 59964 2 43371 58 40.094 1.163e11 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Mauchly Tests for Sphericity
Test statistic pvalue
Factor 0.96168 0.57866
GreenhouseGeisser and HuynhFeldt Corrections
for Departure from Sphericity
GG eps Pr(>F[GG])
Factor 0.96309 2.595e11 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
HF eps Pr(>F[HF])
Factor 1.03025 1.163294e11
The ANOVA table isn’t pretty, but it works. As you can see, the ANOVA was wildly significant.
Next, I generate data for a twoway (2x3) repeated measures ANOVA. Condition A is the same data as above. Condition B has a different pattern (2 is lower than 1 and 3), which should produce an interaction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Type III Repeated Measures MANOVA Tests:

Term: (Intercept)
Response transformation matrix:
(Intercept)
cond_1a 1
cond_2a 1
cond_3a 1
cond_1b 1
cond_2b 1
cond_3b 1
Sum of squares and products for the hypothesis:
(Intercept)
(Intercept) 401981075
Sum of squares and products for error:
(Intercept)
(Intercept) 185650.5
Multivariate Tests: (Intercept)
Df test stat approx F num Df den Df Pr(>F)
Pillai 1 0.9995 62792.47 1 29 < 2.22e16 ***
Wilks 1 0.0005 62792.47 1 29 < 2.22e16 ***
HotellingLawley 1 2165.2575 62792.47 1 29 < 2.22e16 ***
Roy 1 2165.2575 62792.47 1 29 < 2.22e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Term: Factor1
Response transformation matrix:
Factor11
cond_1a 1
cond_2a 1
cond_3a 1
cond_1b 1
cond_2b 1
cond_3b 1
Sum of squares and products for the hypothesis:
Factor11
Factor11 38581.51
Sum of squares and products for error:
Factor11
Factor11 142762.3
Multivariate Tests: Factor1
Df test stat approx F num Df den Df Pr(>F)
Pillai 1 0.2127533 7.837247 1 29 0.0090091 **
Wilks 1 0.7872467 7.837247 1 29 0.0090091 **
HotellingLawley 1 0.2702499 7.837247 1 29 0.0090091 **
Roy 1 0.2702499 7.837247 1 29 0.0090091 **

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Term: Factor2
Response transformation matrix:
Factor21 Factor22
cond_1a 1 0
cond_2a 0 1
cond_3a 1 1
cond_1b 1 0
cond_2b 0 1
cond_3b 1 1
Sum of squares and products for the hypothesis:
Factor21 Factor22
Factor21 91480.01 77568.78
Factor22 77568.78 65773.02
Sum of squares and products for error:
Factor21 Factor22
Factor21 90374.60 56539.06
Factor22 56539.06 87589.85
Multivariate Tests: Factor2
Df test stat approx F num Df den Df Pr(>F)
Pillai 1 0.5235423 15.38351 2 28 3.107e05 ***
Wilks 1 0.4764577 15.38351 2 28 3.107e05 ***
HotellingLawley 1 1.0988223 15.38351 2 28 3.107e05 ***
Roy 1 1.0988223 15.38351 2 28 3.107e05 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Term: Factor1:Factor2
Response transformation matrix:
Factor11:Factor21 Factor11:Factor22
cond_1a 1 0
cond_2a 0 1
cond_3a 1 1
cond_1b 1 0
cond_2b 0 1
cond_3b 1 1
Sum of squares and products for the hypothesis:
Factor11:Factor21 Factor11:Factor22
Factor11:Factor21 179585.9 384647
Factor11:Factor22 384647.0 823858
Sum of squares and products for error:
Factor11:Factor21 Factor11:Factor22
Factor11:Factor21 92445.33 45639.49
Factor11:Factor22 45639.49 89940.37
Multivariate Tests: Factor1:Factor2
Df test stat approx F num Df den Df Pr(>F)
Pillai 1 0.901764 128.5145 2 28 7.7941e15 ***
Wilks 1 0.098236 128.5145 2 28 7.7941e15 ***
HotellingLawley 1 9.179605 128.5145 2 28 7.7941e15 ***
Roy 1 9.179605 128.5145 2 28 7.7941e15 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Univariate Type III RepeatedMeasures ANOVA Assuming Sphericity
SS num Df Error SS den Df F Pr(>F)
(Intercept) 66996846 1 30942 29 62792.4662 < 2.2e16 ***
Factor1 6430 1 23794 29 7.8372 0.009009 **
Factor2 26561 2 40475 58 19.0310 4.42e07 ***
Factor1:Factor2 206266 2 45582 58 131.2293 < 2.2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Mauchly Tests for Sphericity
Test statistic pvalue
Factor2 0.96023 0.56654
Factor1:Factor2 0.99975 0.99648
GreenhouseGeisser and HuynhFeldt Corrections
for Departure from Sphericity
GG eps Pr(>F[GG])
Factor2 0.96175 6.876e07 ***
Factor1:Factor2 0.99975 < 2.2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
HF eps Pr(>F[HF])
Factor2 1.028657 4.420005e07
Factor1:Factor2 1.073774 2.965002e22
Again, the anova table isn’t too pretty.
This obviously isn’t the most exciting post in the world, but its a nice bit of code to have in your back pocket if you’re doing experimental analyses in python.
]]>To answer this question, I will look at how NBA players “group” together. For example, there might be a group of players who collect lots of rebounds, shoot poorly from behind the 3 point line, and block lots of shots. I might call these players forwards. If we allow player performance to create groups, what will these groups look like?
To group players, I will use kmeans clustering (https://en.wikipedia.org/wiki/Kmeans_clustering).
When choosing a clustering algorithm, its important to think about how the clustering algorithm defines clusters. kmeans minimizes the distance between data points (players in my case) and the center of K different points. Because distance is between the cluster center and a given point, kmeans assumes clusters are spherical. When thinking about clusters of NBA players, do I think these clusters will be spherical? If not, then I might want try a different clustering algorithm.
For now, I will assume generally spherical clusters and use kmeans. At the end of this post, I will comment on whether this assumption seems valid.
1 2 3 4 5 6 7 

We need data. Collecting the data will require a couple steps. First, I will create a matrix of all players who ever played in the NBA (via the NBA.com API).
1 2 3 4 5 6 

In the 19791980 season, the NBA started using the 3point line. The 3point has dramatically changed basketball, so players performed different before it. While this change in play was not instantaneous, it does not make sense to include players before the 3point line.
1 2 

I have a list of all the players after 1979, but I want data about all these players. When grouping the players, I am not interested in how much a player played. Instead, I want to know HOW a player played. To remove variability associated with playing time, I will gather data that is standardized for 36 minutes of play. For example, if a player averages 4 points and 12 minutes a game, this player averages 12 points per 36 minutes.
Below, I have written a function that will collect every player’s performance per 36 minutes. The function collects data one player at a time, so its VERY slow. If you want the data, it can be found on my github (https://github.com/dvatterott/nba_project).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

1


Index([u'PLAYER_ID', u'LEAGUE_ID', u'TEAM_ID', u'GP', u'GS',
u'MIN', u'FGM', u'FGA', u'FG_PCT', u'FG3M',
u'FG3A', u'FG3_PCT', u'FTM', u'FTA', u'FT_PCT',
u'OREB', u'DREB', u'REB', u'AST', u'STL',
u'BLK', u'TOV', u'PF', u'PTS'],
dtype='object')
Great! Now we have data that is scaled for 36 minutes of play (per36 data) from every player between 1979 and 2016. Above, I printed out the columns. I don’t want all this data. For instance, I do not care about how many minutes a player played. Also, some of the data is redundant. For instance, if I know a player’s field goal attempts (FGA) and field goal percentage (FG_PCT), I can calculate the number of made field goals (FGM). I removed the data columns that seem redundant. I do this because I do not want redundant data exercising too much influence on the grouping process.
Below, I create new data columns for 2 point field goal attempts and 2 point field goal percentage. I also remove all players who played less than 50 games. I do this because these players have not had the opportunity to establish consistent performance.
1 2 3 4 5 6 7 

It’s always important to visualize the data, so lets get an idea what we’re working with!
The plot below is called a scatter matrix. This type of plot will appear again, so lets go through it carefully. Each subplot has the feature (stat) labeled on its row which serves as its yaxis. The column feature serves as the xaxis. For example the subplot in the second column of the first row plots 3point field goal attempts by 3point field goal percentage. As you can see, players that have higher 3point percentages tend to take more 3pointers… makes sense.
On the diagonals, I plot the Kernel Density Estimation for the sample histogram. More players fall into areas where where the line is higher on the yaxis. For instance, no players shoot better than ~45% from behind the 3 point line.
One interesting part about scatter matrices is the plots below the diagonal are a reflection of the plots above the diagonal. For example, the data in the second column of the first row and the first column of the second row are the same. The only difference is the axes have switched.
1 2 3 

There are a couple things to note in the graph above. First, there’s a TON of information there. Second, it looks like there are some strong correlations. For example, look at the subplots depicting offensive rebounds by defensive rebounds.
While I tried to throw out redundant data, I clearly did not throw out all redundant data. For example, players that are good 3point shooters are probably also good free throw shooters. These players are simply good shooters, and being a good shooter contributes to multiple data columns above.
When I group the data, I do not want an ability such as shooting to contribute too much. I want to group players equally according to all their abilities. Below I use a PCA to seperate variance associated with the different “components” (e.g., shooting ability) of basketball performance.
For an explanation of PCA I recommend this link  https://georgemdallas.wordpress.com/2013/10/30/principalcomponentanalysis4dummieseigenvectorseigenvaluesanddimensionreduction/.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

On the left, I plot the amount of variance explained after including each additional PCA component. Using all the components explains all the variability, but notice how little the last few components contribute. It doesn’t make sense to include a component that only explains 1% of the variability…but how many components to include!?
I chose to include the first 5 components because no component after the 5th explained more than 5% of the data. This part of the analysis is admittedly arbitrary, but 5% is a relatively conservative cutoff.
Below is the fun part of the data. We get to look at what features contribute to the different principle components.
One thing to keep in mind here is that each component explains less variance than the last. So while 3 point shooting contributes to both the 1st and 5th component, more 3 point shooting variability is probably explained by the 1st component.
It would be great if we had a PCA component that was only shooting and another that was only rebounding since we typically conceive these to be different skills. Yet, there are multiple aspects of each skill. For example, a 3point shooter not only has to be a deadeye shooter, but also has to find ways to get open. Additionally, being good at “getting open” might be something akin to basketball IQ which would also contribute to assists and steals!
1 2 3 

Cool, we have our 5 PCA components. Now lets transform the data into our 5 component PCA space (from our 13 feature space  e.g., FG3A, FG3_PCT, ect.). To do this, we give each player a score on each of the 5 PCA components.
Next, I want to see how players cluster together based on their scores on these components. First, let’s investigate how using more or less clusters (i.e., groups) explains different amounts of variance.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 

As you can in the left hand plot, adding more clusters explains more of the variance, but there are diminishing returns. Each additional cluster explains a little less data than the last (much like each PCA component explained less variance than the previous component).
The particularly intersting point here is the point where the second derivative is greatest, when the amount of change changes the most (the elbow). The elbow occurs at the 6th cluster.
Perhaps not coincidently, 6 clusters also has the highest silhouette score (right hand plot). The silhouette score computes the average distance between a player and all other players in this player’s cluster. It then divides this distance by the distance between this player and all players in the next nearest cluster. Silhouette scores range between 1 and 1 (where negative one means the player is in the wrong cluster, 0 means the clusters completely overlap, and 1 means the clusters are extermely well separated).
Six clusters has the highest silhouette score at 0.19. 0.19 is not great, and suggests a different clustering algorithm might be better. More on this later.
Because 6 clusters is the elbow and has the highest silhouette score, I will use 6 clusters in my grouping analysis. Okay, now that I decided on 6 clusters lets see what players fall into what clusters!
1 2 3 4 5 6 7 8 9 10 

Awesome. Now lets see how all the clusters look. These clusters were created in 5 dimensional space, which is not easy to visualize. Below I plot another scatter matrix. The scatter matrix allows us to visualize the clusters in different 2D combinations of the 5D space.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 

In this plot above. I mark the center of a given cluster with an X. For example, Cluster 0 and Cluster 5 are both high in outside skills. Cluster 5 is also high in rim scoring, but low in pure points.
Below I look at the players in each cluster. The first thing I do is identify the player closest to the cluster’s center. I call this player the prototype. It is the player that most exemplifies a cluster.
I then show a picture of this player because… well I wanted to see who these players were. I print out this player’s stats and the cluster’s centroid location. Finally, I print out the first ten players in this cluster. This is the first ten players alphabetically. Not the ten players closest to cluster center.
1 2 3 4 5 6 7 8 9 10 11 

Outside Skills  Rim Scoring  Pure Points  Defensive Big Man  Dead Eye 

0.830457  0.930833  0.28203  0.054093  0.43606 
16 Afflalo, Arron
20 Ainge, Danny
40 Allen, Ray
46 Alston, Rafer
50 Aminu, AlFarouq
53 Andersen, David
54 Anderson, Alan
56 Anderson, Derek
60 Anderson, James
63 Anderson, Kyle
Name: Name, dtype: object
First, let me mention that cluster number is a purely categorical variable. Not ordinal. If you run this analysis, you will likely create clusters with similar players, but in a different order. For example, your cluster 1 might be my cluster 0.
Cluster 0 has the most players (25%; about 490 of the 1965 in this cluster analysis) and is red in the scatter matrix above.
Cluster 0 players are second highest in outside shooting (in the table above you can see their average score on the outside skills component is 0.83). These players are lowest in rim scoring (0.93), so they do not draw many fouls  they are basically the snipers from the outside.
The prototype is Lloyd Daniels who takes a fair number of 3s. I wouldn’t call 31% a dominant 3point percentage, but its certainly not bad. Notably, Lloyd Daniels doesn’t seem to do much but shoot threes, as 55% of his shots come from the great beyond.
Cluster 0 notable players include Andrea Bargnani, JJ Barea, Danilo Gallinari, and Brandon Jennings. Some forwards. Some Guards. Mostly good shooters.
On to Cluster 1… I probably should have made a function from this code, but I enjoyed picking the players pictures too much.
1 2 3 4 5 6 7 8 

Outside Skills  Rim Scoring  Pure Points  Defensive Big Man  Dead Eye 

0.340177  1.008111  1.051622  0.150204  0.599516 
1 AbdulJabbar, Kareem
4 AbdurRahim, Shareef
9 Adams, Alvan
18 Aguirre, Mark
75 Antetokounmpo, Giannis
77 Anthony, Carmelo
85 Arenas, Gilbert
121 Baker, Vin
133 Barkley, Charles
148 Bates, Billyray
Name: Name, dtype: object
Cluster 1 is green in the scatter matrix and includes about 14% of players.
Cluster 1 is highest on the rim scoring, pure points, and Dead Eye components. These players get the ball in the hoop.
Christian Laettner is the prototype. He’s a solid scoring forward.
Gilbert Arenas stands out in the first ten names as I was tempted to think of this cluster as big men, but it really seems to be players who shoot, score, and draw fouls.
Cluster 1 Notable players include James Harden,Kevin Garnet, Kevin Durant, Tim Duncan, Kobe, Lebron, Kevin Martin, Shaq, Anthony Randolph??, Kevin Love, Derrick Rose, and Michael Jordan.
1 2 3 4 5 6 7 8 

Outside Skills  Rim Scoring  Pure Points  Defensive Big Man  Dead Eye 

0.013618  0.101054  0.445377  0.347974  1.257634 
2 AbdulRauf, Mahmoud
3 AbdulWahad, Tariq
5 Abernethy, Tom
10 Adams, Hassan
14 Addison, Rafael
24 Alarie, Mark
27 Aldridge, LaMarcus
31 Alexander, Courtney
35 Alford, Steve
37 Allen, Lavoy
Name: Name, dtype: object
Cluster 2 is yellow in the scatter matrix and includes about 17% of players.
Lots of big men who are not outside shooters and don’t draw many fouls. These players are strong 2 point shooters and free throw shooters. I think of these players as midrange shooters. Many of the more recent Cluster 2 players are forwards since midrange guards do not have much of a place in the current NBA.
Cluster 2’s prototype is Doug West. Doug West shoots well from the free throw line and on 2point attempts, but not the 3point line. He does not draw many fouls or collect many rebounds.
Cluster 2 noteable players include LaMarcus Aldridge, Tayshaun Prince, Thaddeus Young, and Shaun Livingston
1 2 3 4 5 6 7 8 

Outside Skills  Rim Scoring  Pure Points  Defensive Big Man  Dead Eye 

1.28655  0.467105  0.133546  0.905368  0.000679 
7 Acres, Mark
8 Acy, Quincy
13 Adams, Steven
15 Adrien, Jeff
21 Ajinca, Alexis
26 Aldrich, Cole
34 Alexander, Victor
45 Alston, Derrick
51 Amundson, Lou
52 Andersen, Chris
Name: Name, dtype: object
Cluster 3 is blue in the scatter matrix and includes about 16% of players.
Cluster 3 players do not have outside skills such as assists and 3point shooting (they’re last in outside skills). They do not draw many fouls or shoot well from the free throw line. These players do not shoot often, but have a decent shooting percentage. This is likely because they only shoot when wide open next to the hoop.
Cluster 3 players are highest on the defensive big man component. They block lots of shots and collect lots of rebounds.
The Cluster 3 prototype is Kelvin Cato. Cato is not and outside shooter and he only averages 7.5 shots per 36, but he makes these shots at a decent clip. Cato averages about 10 rebounds per 36.
Notable Cluster 3 players include Andrew Bogut, Tyson Chandler, Andre Drummond, Kawahi Leonard??, Dikembe Mutumbo, and Hassan Whiteside.
1 2 3 4 5 6 7 8 

Outside Skills  Rim Scoring  Pure Points  Defensive Big Man  Dead Eye 

0.668445  0.035927  0.917479  1.243347  0.244897 
0 Abdelnaby, Alaa
17 Ager, Maurice
28 Aleksinas, Chuck
33 Alexander, Joe
36 Allen, Jerome
48 Amaechi, John
49 Amaya, Ashraf
74 Anstey, Chris
82 Araujo, Rafael
89 Armstrong, Brandon
Name: Name, dtype: object
Cluster 4 is cyan in the scatter matrix above and includes the least number of players (about 13%).
Cluster 4 players are not high on outsize skills. They are average on rim scoring. They do not score many points, and they don’t fill up the defensive side of the stat sheet. These players don’t seem like all stars.
Looking at Doug Edwards’ stats  certainly not a 3point shooter. I guess a good description of cluster 4 players might be … NBA caliber bench warmers.
Cluster 4’s notable players include Yi Jianlian and Anthony Bennet….yeesh
1 2 3 4 5 6 7 8 9 

Outside Skills  Rim Scoring  Pure Points  Defensive Big Man  Dead Eye 

0.890984  0.846109  0.926444  0.735306  0.092395 
12 Adams, Michael
30 Alexander, Cory
41 Allen, Tony
62 Anderson, Kenny
65 Anderson, Mitchell
78 Anthony, Greg
90 Armstrong, Darrell
113 Bagley, John
126 Banks, Marcus
137 Barrett, Andre
Name: Name, dtype: object
Cluster 5 is magenta in the scatter matrix and includes 16% of players.
Cluster 5 players are highest in outside skills and second highest in rim scoring yet these players are dead last in pure points. It seems they score around the rim, but do not draw many fouls. They are second highest in defensive big man.
Gerald Henderson Sr is the prototype. Henderson is a good 3 point and free throw shooter but does not draw many fouls. He has lots of assists and steals.
Of interest mostly because it generates an error in my code, Gerald Henderson Jr is in cluster 2  the mid range shooters.
Notable cluster 5 players include Mugsy Bogues, MCW, Jeff Hornacek, Magic Johnson, Jason Kidd, Steve Nash, Rajon Rando, John Stockton. Lots of guards.
In the cell below, I plot the percentage of players in each cluster.
1 2 3 4 

I began this post by asking whether player positions is the most natural way to group NBA players. The clustering analysis here suggests not.
Here’s my take on the clusters: Cluster 0 is pure shooters, Cluster 1 is talented scorers, Cluster 2 is midrange shooters, Cluster 3 is defensive bigmen, Cluster 4 is bench warmers, Cluster 5 is distributors. We might call the “positions” shooters, scorers, rim protectors, and distributors.
It’s possible that our notion of position comes more from defensive performance than offensive. On defense, a player must have a particular size and agility to guard a particular opposing player. Because of this, a team will want a range of sizes and agility  strong men to defend the rim and quick men to defend agile ball carriers. Box scores are notoriously bad at describing defensive performance. This could account for the lack of “positions” in my cluster.
I did not include player height and weight in this analysis. I imagine height and weight might have made clusters that resemble the traditional positions. I chose to not include height and weight because these are player attributes; not player performance.
After looking through all the groups one thing that stands out to me is the lack of specialization. For example we did not find a single cluster of incredible 3point shooters. Cluster 1 includes many great shooters, but it’s not populated exclusively by great shooters. It would be interesting if adding additional clusters to the analysis could find more specific clusters such as bigmen that can shoot from the outside (e.g., Dirk) or highvolume scorers (e.g., Kobe).
I tried to list some of the aberrant cluster choices in the notable players to give you an idea for the amount of error in the clustering. These aberrant choices are not errors, they are simply an artifact of how kmeans defines clusters. Using a different clustering algorithm would produce different clusters. On that note, the silhouette score of this clustering model is not great, yet the clustering algorithm definitely found similar players, so its not worthless. Nonetheless, clusters of NBA players might not be spherical. This would prevent a high silhouette score. Trying a different algorithm without the spherical clusters assumption would definitely be worthwhile.
Throughout this entire analysis, I was tempted to think about group membership, as a predictor of a player’s future performance. For instance, when I saw Karl Anthony Towns in the same cluster as Kareem AbdulJabbar, I couldn’t help but think this meant good things for Karl Anthony Towns. Right now, this doesn’t seem justified. No group included less than 10% of players so not much of an oppotunity for a uniformly “star” group to form. Each group contained some good and some bad players. Could more clusters change this? I plan on examining whether more clusters can improve the clustering algorithm’s ability to find clusters of exclusively quality players. If it works, I’ll post it here.
]]>