Further exploration of IMDb TV show rating data

I wanted to revist my previous post continuing to look at using linear regression for determining the best episodes of a TV show to watch. I started to think about how to look at this data for multiple TV shows. Performing a linear regression on show rating by episode number within a season quickly allows us to determine the maximum and minimum residual for all the show episodes. I took this a step further and calculated which episode of the show it was. For example, here are all the episodes with residual value for that particular show Master of None:

Season	Episode	Name	Residual	count	appearance
1	1	Plan B	-0.28	1	0.05
1	2	Parents	0.21	2	0.1
1	3	Hot Ticket	0.01	3	0.15
1	4	Indians on TV	0.21	4	0.2
1	5	The Other Man	-0.09	5	0.25
1	6	Nashville	0.31	6	0.3
1	7	Ladies and Gentlemen	-0.39	7	0.35
1	8	Old People	-0.09	8	0.4
1	9	Mornings	0.11	9	0.45
1	10	Finale	0.01	10	0.5
2	1	The Thief	0.44	11	0.55
2	2	Le Nozze	-0.36	12	0.6
2	3	Religion	-0.27	13	0.65
2	4	First Date	0.13	14	0.7
2	5	The Dinner Party	0.02	15	0.75
2	6	New York, I Love You	0.42	16	0.8
2	7	Door #3	-0.89	17	0.85
2	8	Thanksgiving	0.31	18	0.9
2	9	Amarsi Un Po	0.30	19	0.95
2	10	Buona Notte	-0.10	20	1

We can see that the episode with the highest residual is S2E1 “The Thief” and the episode with the lowest residual is S2E7 “Door #3”. For every TV show I took all the episodes and calculated their order as a percent of the total number of episodes - for example the pilot episode would be 0.0 and the series finale would be 1.0 to generate an index. I then took the maximum and minimum residual values for each show and plotted them against that episode. For example here is a plot of just Master of None:

To obtain data on as many shows as I could I used this IMDb list of shows with over 5000 votes and selected the first 1200 shows as a dataset. I then reused the OMDb API as I did before. I then calculated the same values as I did for Master of None above and plotted them in a similar manner (use the mouseover for more information on each point):

Two things immediately jump out at me:

The density of points right around the zero line shows that linear regression is a pretty good metric to use for this type of analysis and that most people rate the show generally in line with the overall trend for that particular season.
There seems to be a tendancy for people to really love or really hate the series finale of TV shows and this shows up by the sheer number of points at 1. Possibly this is people expressing their overall view of the show as a whole or maybe people really were really happy or unhappy with the series finale.

I put some of the main code I used in a GitHub repository

Zach Stednick

Further Exploration of IMDb TV Show Rating Data