National Basketball Association

Data Science Intern

When: Jun 2019 - Aug 2019
Where: New York, New York

As a data science intern for the NBA corporation, I split my time between the New York and Secaucus offices over the summer of 2019, working my dream role as an integral part of the Stats & Analytics team in the Media Operations & Technology branch.

Team Structure

The Media Operations and Technology (MOT) operating arm of the NBA Properties group maintains the digital products that consumers and businesses interface with. This includes, but is not limited to, the NBA.com/WNBA.com websites, and other subsidiaries of the NBA, including the NBA Store, 2K League, etc.
One of the sub-teams of MOT was the Stats Technology team, which works cross-functionally within the MOT on the data applications of internal and external tools. The most important of these tools the team works on is NBA.com/Stats. This was my main focus point for the summer.

Case Study 1: Ball Screens + Data Engineering

To the casual fan of basketball, ball screens may be quite easy to spot during a game. For the uninitiated, a quick graphic shows the event during a game in which a ball screen occurs:

Aggregating the total number of screens a team commits to in a game is quite an important stat-line reported during media broadcasts, and has tremendous business value to teams attempting to build strategies around this. Therefore, the NBA.com/Stats product would be the perfect place to report on these values.

Great! So what's the problem?

What may be surprising to some is that this one or two second exchange can be quite difficult for algorithms to detect. The NBA's trusted source for player positional data is Second Spectrum, which has its own proprietary algorithm for determining ball screens during a game, but my manager seemed doubtful that the numbers were being reported accurately.
There were questions left unanswered from this data that our stats team was ingesting: "are the ball screens coming in through Second Spectrum and accurate measure? Overestimating? Underestimating?" and "Is it worth it for the NBA to run their own algorithms to detect ball screens?" These were the questions that I played a big role in finding an answer.

Quality Assurance and Analysis

The first empirical study conducted was to find out the error rates for the Second Spectrum (2S) labeled screens. The best way to do so was to frame this as a binary classification: with {0 = no screen, 1 = yes screen} as the two possible states for each "measured" time period in which a ball screen was marked on 2S.
I worked through the video playbacks of 10 NBA regular season/playoff games and manually labeled times in the game slotted by 2S I determined to be screens. I also denoted times when the algorithm would miss ball screens, though these were rare occurrences, since the algorithm tended to overestimate.
I then compared these labeled values to 2S's values. Assuming that the 2S labeled screen were the observed labels, a simple calculation revealed the following:
  1. Precision (Ratio of true positives to the sum of true positives and false positives) was at a staggering ~0.72 average (~0.09 variance) over 10 games measured, which meant that, on average, 28% of labeled ball screens by the 2S algorithm were INCORRECT within a single game.
  2. A later calculation with labeled ball screens from employees of the NBA Replay Center showed statistical significance within a 95% confidence interval using two-tailed two-sample paired location t-tests.
(see more information on how to set up the test I used here)
So, is it worth it for the NBA to create their own ball-screen detection algorithm? Given the incredibly low precision rates of the 10 samples we found, it was incredibly important that we created one ourselves.

Slimming Down Batch Delay Times

Batch delay times were a big hurdle for the NBA to navigate. For context, the NBA receives data from Second Spectrum within daily intervals (a large batched data dump from Second Spectrum's AWS Redshift store to the NBA's data warehouse in Google BigQuery). This extraction process is managed through Cloud Composer with Apache Airflow built inside.
Now, daily dumps were not necessarily a problem because the NBA had no immediate requirement to retrieve the player tracking data until the following day. However, with the necessity of a new ball-screen tracking algorithm, it was important that the NBA stats team now gets the data closer to real-time so the NBA's algorithm can run its magic. A very brief snapshot of the data engineering process are shown below.

Batch processing was done close to 9 hour intervals through CRON-defined jobs in Cloud Composer. This would trigger the extraction process into the NBA's data warehouse.

Through some careful planning between the engineers of Second Spectrum and the NBA, I determined that the below architecture would be permissible for our ultimate goal:

Lots of different changes: while it says "stream processing" this was still technically a 10-minute batch process. This is to simulate receiving player tracking data every QUARTER instead of every day.

Event triggers would be conducted through listeners in Google Cloud Function (serverless) and Pub/Sub, which subscribes to changes that happen within 2S's AWS buckets.
The most challenging part of writing the python logic to complete this task was the incredible difficulty I had trying to subscribe a Google event-listener to an AWS product. Who would have thought...
Sadly, my internship ended before I was able to see this project come to completion, but I am proud that I was able to at least run some testing with the new architecture before I left.

Business of Basketball (BoB) Intern Project: Tentpole Activation

Another fun project I had the privilege to be a part of was to work with the other NBA interns of my class to come up with new business strategies on how to engage fans at tentpole events (such as the NBA All-Star game).

Video of the Presentation (28 min)

Wireframes of the App