Visualizing Flight Data for the 2017 Seattle Mariners

Remember this map that Facebook created of friend connections back in 2011?

I thought it was pretty cool back then and I still think its pretty cool. I wanted to make a similar map but was not sure where to start. I could have done a similar visualization however I recently quit Facebook so I can no longer export all my friend’s data to use for making maps. My next thought was visualizing travel routes such as flight information. I am trying to reduce my carbon footprint which meant I only flew five times in 2017 and have flown exactly zero times so far in 2018. Then I thought, you know who does fly alot? The Seattle Mariners.

First step was to collect all the Mariners game data, fortunately Baseball Reference has all that data in an easily accessible HTML table.

Next step was to geolocate all the stadiums which can be a bit tedious. Fortunately GitHub user the55 created a nice JSON file of all the stadiums and put it as a gist. I was able to use an R library called geosphere for using the Haversine formula to calculate the distance between two stadiums.

My initial attempt here:

In order to make the image look similar to the Facebook connection map, I ended up using this Flowing Data post quite a bit to figure out how to add the lines and change the background color:

Finally because there were so many trips from Seattle to American League West opponents that I ended up adding a bit of noise or jitter to the stadium locations to make the flight paths not perfectly overlap each other.

Looking back at this 2017 reminded me the Mariners finished 78-84 in 2017, here’s hoping to a better season in 2018!

If interested, I put all the code for this analysis here

Further Analysis of the 2017-18 WA State Legislature

This is my second post looking at the data from the 2017-18 Washington State Legislative Session. the first part of this blog can be read here

After some time looking at different bills that did pass, I started to wonder if a bill was more likely to pass if it had more sponsors. First I took the 647 bills passed by the Legislation and signed into law by Governor and looked up how many co-sponsors each bill had:

Then I I took every bill that was introduced but did not become law and counted up the sponsors for these:

So it appears that the number of sponsors is not particulary predictive for a bill becoming law. The three bills introduced in the Senate with the highest number of Sponsors were:

Bill Sponsor count Summary
5598 40 Granting relatives, including but not limited to grandparents, the right to seek visitation with a child through the courts.
6037 28 Concerning the uniform parentage act.
5375 27 Renaming the cancer research endowment authority to the Andy Hill cancer research endowment.

And in the House:

Bill Sponsor count Summary
2282 52 Protecting an open internet in Washington state.
1714 45 Concerning nursing staffing practices at hospitals.
1400 42 Creating Washington state aviation special license plates.

In November 2017, Manka Dhingra won a special election and the Washington State Senate flipped from Republican held to Democrat held. Initially I wanted to focus on the number of bills passed by a Republican held Senate versus a Democrat held Senate but there were too many extraneous variables such as passing a budget and a shorter session in 2018. Instead, I decided to focus on the number of Yea votes by bill

Many of the bills passed were with almost overwhelming support, which is refreshing to see that there is quite a bit of bipartisanship in Washington State in 2018.

As always, analysis code on GitHub

Visualizing the 2017-18 WA State Legislature

In his 2018 State of the State speech, Washington State Governor Jay Inslee made a passioned appeal for a carbon tax and proposed one in Washington State Senate bill 6203. Because of this, I paid more attention to the activities of the Washington State Legislature than I ever had before and I found it fascinating.

First off, lets start with the website for the state Legislature. Here is a screenshot of the Washington State Legislature page for SB 6203 which is the bill I was most interested in:

The website is very resource dense and well worth time exploring when the Legislature is in session. Every piece of proposed legislation shows the same amount of information and allows you to easily find and contact your legislators about a particular bill if interested. The site also has livestreams of committee hearings and displays vote counts on bills in almost real time as the votes are tallied on both the Senate and the House floor.

Is Washington State unique in this regard? Of course not, here is a screen shot for an interesting bill in Legislature for the State of California.

Finally here is a screenshot of a House bill on the United States Congress website

Does ease of use of the website increase participation in the civic process at the state level? That is a difficult question to answer but personally I am glad I get to use the Washington State one instead of the California State Legislature webpage.

The 2017-18 Washington State Legislative Session ended on March 8, 2018 and Governor Inslee then had 21 days to sign bills into law or veto them.

The conclusion of the 2017-18 Session made me wonder what happened to those bills that were introduced and how many of them actually became law. In addition to a great website, the Washington State Legislature also has an excellent set of Web Services that allow for programmatically capturing metrics and data about activities in the state legislation. One way to easily visualize this is with a Sankey Diagram (no relation to this Sankey though).

Here is a smaller image of the diagram with a larger version here

Code to generate this figure available on my GitHub repo

Has the Pac-12 Network Decreased UW Home Football Game Attendance UPDATED

Following up on my earlier post, how much has the Pac-12 Network affected game attendance? I updated my previous data set to include the past two seasons so as to include 2008-2017 data. I relied on home game attendance as reported by Wikipedia and also used Wikipedia to determine what TV network broadcast each home game. In an ideal world I would be able to make better comparisons using the Nielsen rating for each game however my guess is that data does not come as cheap or as easily as data from Wikipedia. For the purposes of this analysis I am neglecting various other factors in this anaysis such as time at kickoff, game day temperature, opponent, ranking of UW, ranking of opponent, etc… the list goes on and on. My main intention was to simply show home game attendance versus TV network for all games:

And attendance for Pac-12 only opponents versus TV network:

Based on the available data it appears that attendance during home games has been influenced and possibly decreased by the Pac-12 Network but it is difficult to say for sure while ignoring so many external factors. With a significant budget deficit still a major issue, one can only hope that losses from game day ticket sales are made up for with Pac-12 Network advertising revenue.

States With Multiple Football Teams in the AP Top 25

With WSU beating Oregon and UW beating UC Berkeley, the State of Washington is poised to have two football teams in the top ten of NCAA Division I football rankings. Naturally this got me thinking, how often does this happen and how many states have had this same achievement?

To answer this I used the weekly results of Associated Press poll which started in 1936 and thanks to our good friends at Wikipedia, I was able to get AP Poll results for every week.

I found that 25 states had at least one week where two teams from that state were in the AP Poll. However, the more I thought about it the more I realized this was slightly biased because some states might only have one team (i.e. Wyoming) while other states might have two Division I teams that are never both great at the same time (i.e. Montana). I tightened down my restrictions a bit and only looked at the top 10 teams from each AP Poll.

Surprisingly, of the 25 states with at least two teams in the AP top 25 Poll, 21 of those states had a week with at least two teams from that state in the AP top 10. I made a summary table with the most recent year each state achieved this distinction listed:

state year
Louisiana 1936
Maryland 1955
North Carolina 1957
New York 1958
Illinois 1963
Indiana 1979
Pennsylvania 1982
Colorado 1994
Kansas 1995
Washington 1997
Ohio 2009
Oregon 2012
Florida 2013
South Carolina 2013
Georgia 2014
Mississippi 2014
California 2015
Alabama 2016
Michigan 2016
Texas 2016
Oklahoma 2017

Then, I thought what if there were ever a week when a state had 3 teams in the AP top 10. Sure enough, four states have achieved this:

state year
California 1952
Indiana 1967
Florida 2005
Texas 2015

As always, all of my code for this is on GitHub

Further Exploration of IMDb TV Show Rating Data

I wanted to revist my previous post continuing to look at using linear regression for determining the best episodes of a TV show to watch. I started to think about how to look at this data for multiple TV shows. Performing a linear regression on show rating by episode number within a season quickly allows us to determine the maximum and minimum residual for all the show episodes. I took this a step further and calculated which episode of the show it was. For example, here are all the episodes with residual value for that particular show Master of None:

Season Episode Name Residual count appearance
1 1 Plan B -0.28 1 0.05
1 2 Parents 0.21 2 0.1
1 3 Hot Ticket 0.01 3 0.15
1 4 Indians on TV 0.21 4 0.2
1 5 The Other Man -0.09 5 0.25
1 6 Nashville 0.31 6 0.3
1 7 Ladies and Gentlemen -0.39 7 0.35
1 8 Old People -0.09 8 0.4
1 9 Mornings 0.11 9 0.45
1 10 Finale 0.01 10 0.5
2 1 The Thief 0.44 11 0.55
2 2 Le Nozze -0.36 12 0.6
2 3 Religion -0.27 13 0.65
2 4 First Date 0.13 14 0.7
2 5 The Dinner Party 0.02 15 0.75
2 6 New York, I Love You 0.42 16 0.8
2 7 Door #3 -0.89 17 0.85
2 8 Thanksgiving 0.31 18 0.9
2 9 Amarsi Un Po 0.30 19 0.95
2 10 Buona Notte -0.10 20 1

We can see that the episode with the highest residual is S2E1 “The Thief” and the episode with the lowest residual is S2E7 “Door #3”. For every TV show I took all the episodes and calculated their order as a percent of the total number of episodes - for example the pilot episode would be 0.0 and the series finale would be 1.0 to generate an index. I then took the maximum and minimum residual values for each show and plotted them against that episode. For example here is a plot of just Master of None:

To obtain data on as many shows as I could I used this IMDb list of shows with over 5000 votes and selected the first 1200 shows as a dataset. I then reused the OMDb API as I did before. I then calculated the same values as I did for Master of None above and plotted them in a similar manner (use the mouseover for more information on each point):

Two things immediately jump out at me:

  1. The density of points right around the zero line shows that linear regression is a pretty good metric to use for this type of analysis and that most people rate the show generally in line with the overall trend for that particular season.

  2. There seems to be a tendancy for people to really love or really hate the series finale of TV shows and this shows up by the sheer number of points at 1. Possibly this is people expressing their overall view of the show as a whole or maybe people really were really happy or unhappy with the series finale.

I put some of the main code I used in a GitHub repository

Smarter Binge Watching With Linear Regression

I am not much of a binge watcher but I do enjoy quality TV shows which is why I think GraphTV is so great. GraphTV plots the IMDb user ratings for every episode and then performs a linear regression of the episode rating by the episode number to create a trend line which helps you see if the show gets better or worse over the course of the season.

This is nice but it can get difficult to use GraphTV for shows like Golden Girls and downright impossible for shows like The Simpsons.

To solve this I created a GitHub repo binge-trendy. Because the trend line is fit to the IMDb user rating data, we are interested in which episodes do IMDb users think are better than the regression model predicts which translates to any deviation from the trend line. Since I am only interested in episodes that are rated higher than the regression model would have predicted, I only look at episodes with a positive residual.

For example, Golden Girls season 4

Season Episode Name
4 1 Yes, We Have No Havanas
4 2 The Days and Nights of Sophia Petrillo
4 6 Sophia’s Wedding: Part 1
4 9 Scared Straight
4 11 The Auction
4 14 Love Me Tender
4 15 Valentine’s Day
4 19 Till Death Do We Volley
4 20 High Anxiety
4 22 Sophia’s Choice
4 23 Rites of Spring
4 24 Foreign Exchange

I realize the code is not great, pylint currently gives it a 6.05 but if there is one thing I have learned in software:

Standing Up for Net Neutrality

Currently there are many political issues that demand attention however in my opinion there are none that would affect more people than the possible destruction of net neutrality.

Net neutrality is simply the principle that all data on the internet should be treated the same. It does not matter if you are visiting Fox News or Mother Jones - the data and content from both of these websites (as well as from every other website) should be treated as equal and that data should be served equally by all Internet Service Providers. Losing net neutrality could lead to an internet that favors one of these two sites based on which site is willing to pay more. I chose these sites because they are such polar opposites but at the same time we live in a country that allows for such opposites to have equal protection of freedom of speech. I may disagree with the content of a particular website but I do not think it should be served any differently than the website of a site I do agree with. Destruction of net neutrality will lead to greater influence wielded by larger corporations and could stifle smaller websites and startups.

Fortunately, there is still time to act. On July 12, various online communities and users will come together to stand tall and sound the alarm about the FCC’s attack on net neutrality. Join us here!

Pronto Post-mortem

Pronto bike share ends this Friday March 31st and I will miss it for sure. I wrote about why Pronto mattered to me and I even rode a Pronto bike in the 25 mile Obliteride last year:

In October 2015, there was a Pronto sponsored contest to visualize bikeshare ride data. I created an entry which although did not win, was a nice introduction for me to learn mapping with D3. As I was checking out the Pronto site one last time today I noticed that they had updated their publically available dataset to include 2016 ride data as well as the 2015 ride data.

To me it seems that Pronto had a hard time expanding and encouraging repeat riders. Unfortunately we do not have the membership data but if we can assume that people who did not ride much in 2015 did not renew their membership in 2016 then it looks pretty clear that Pronto was hurting more than people thought. Also, it looked like they had good success in 2015 with getting people to buy day passes especially during peak tourist season in the summer and were able to replicate that success in 2016. I feel there is a need for a dedicated bike share in Seattle however this iteration of bike share does not appear to be the solution we need.

I went back to the Pronto site to fetch all my data because of an idea I worked on then abandoned last summer. The idea was for a website that was basically Strava for Pronto whereby you compared your ride time data to everyone else’s and mapped out how fast you were compared to them. Pronto did not make it easy to download all your trip data so I ended up having to write a webscraper to get out my own data (Hint, hint Pronto 2.0!) which I put at this GitHub repository. I never was able to get my project past the personal level and my ultimate goal was to simply make a map of the route and add in plots. Here is an example of all trips from Fairview Ave N & Ward St to Westlake Ave & 6th Ave:

I’m sad that I can’t work on this project anymore but maybe with Pronto 2.0 I will be able to revist this idea.

Have There Been More Upsets in the NCAA Tournament Recently?

I have been following the NCAA Men’s Basketball Tournament for as long as I can remember and with Selection Sunday coming up, I wondered if there have been more or less upsets in recent tournaments. To look at this visually I used a hypothetical perfect bracket as a reference (i.e. #1 seed beats #16 seed, #2 seed beats #15 seed all the way to #1 seed beating a #2 seed). I took the sum of all the winning seed numbers at each round in the Regional Tournament and used that as the denominator for comparison with the other Regional Tournaments for that particular year.

I went back in time as far as I could but the 2007 Tournament finally harmonized the names of the Regional tournaments with the names East, West, South, and Midwest which made for easier comparison across years.

Clearly there have been quite a lot of upsets in the past ten years especially within the Midwest Region.

I then went back and looked at all games back to 1985 when the Tournament first expanded to 64 teams. For this I did not have all the Regional Tournament information so I just looked at all the games (except the Final Four).

The aggregate data is pretty volatile year over year as well with a low in 2007. If anything, this shows we should be in for another great year of NCAA tournament basketball complete with some (hopefully many) exciting upsets.