Which Star Trek movie is the best?
The short answer is the 2009 Star Trek reboot. The top five Star Trek movies based on average user ratings on IMDb and Rotten Tomatoes, as well as critical ratings compiled with meta-criticism and Rotten Tomatoes, are as follows:
Movie | average rating |
---|---|
Star Trek (2009) | 4.32 stars |
Star Trek: First Contact | 4.08 stars |
Star Trek Into Darkness | 4.04 stars |
Star Trek II: The Wrath of Khan | 4.04 stars |
Star Trek IV: The Voyage Home | 3.79 stars |
To many trekkies the entries for First Contact and the Wrath of Khan won’t come as a surprise. These movies are generally regarded as high points in the franchise. However, what astonished me was just how successful the 2009 reboot was. Star Trek never before managed to churn out three decent movies in a row. There is hope that the next film will be a similarly good.
How you can get at all this information is shown in this blog post.
Data, the final frontier. These are the adventures of a data scientists. His continuing mission to explore strange new patterns, to seek out new insights and new visualisations, to boldly find out what no one has found out before…
Data acquisition: using IMDb API and web scraping
Start off by loading all the necessary modules. Because of IMDb-py I use python 2.7.
import imdb as imdb # to access imdb API import pandas as pd # for data array handling from BeautifulSoup import BeautifulSoup # for website parsing and scraping (rotten tomatoes) import requests # for http access import re # for regular expressions from ggplot import * # for plotting import urllib2 # for accessing url object (movie covers) import matplotlib.pyplot as plt # for plotting from matplotlib.offsetbox import (OffsetImage, AnnotationBbox)
Next, we sample all Star Trek movies. We shall use the IMDb search function for that.
imdb_http = imdb.IMDb() # create imdb API object StarTrek = imdb_http.search_movie('Star Trek') # general search for Star Trek among movie and series titles
Because we are not interested in series or video games, we extract only the movies with a simple list comprehension.
STF = [i for i in StarTrek if i.data['kind'] == 'movie']
Unfortunately, IMDb’s search function is not flawless and misses 5 movies. We search for them individually and add them to the list STF
which holds the movies.
StarTrekIII = imdb_http.search_movie('Star Trek III the Search for Spock') StarTrekIV = imdb_http.search_movie('Star Trek IV the voyage home') StarTrekV = imdb_http.search_movie('Star Trek V the Final Frontier') StarTrekVI = imdb_http.search_movie('Star Trek VI the Undiscovered Country') StarTrekFC = imdb_http.search_movie('Star Trek First Contact') STF.extend([StarTrekIII[0], StarTrekIV[0], StarTrekV[0], StarTrekVI[0], StarTrekFC[0]])
Now, we can loop through each movie and add all the information we want to a pandas data frame df
. The IMDb API gives access to IMDb user ratings and metacritic ratings. However, in order to get rotten tomatoes ratings we turn to web scraping using BeautifulSoup
.
Note that different ratings use different scales. I decided to turn all of them into an intuitive 6-point scale (zero to five stars).
df = pd.DataFrame(columns=['date', 'IMDb_rating', 'Metacritic_rating', 'title', 'image_url']) # initialise data frame for i in range(len(STF)): # for each Star Trek movie imdb_http.update(STF[i]) # IMDb: augment movie info x = imdb_http.get_movie_critic_reviews(STF[i].movieID) # Meta critic # rotten tomato: prepare website parsing tomato_base_url = 'https://www.rottentomatoes.com/m/' tomato_url = tomato_base_url + re.sub(':', '', re.sub(' ', '_', str(STF[i]['title']))) if 'Star Trek' not in STF[i]['title']: # fix first contact problem tomato_url = tomato_base_url + re.sub(':', '', re.sub(' ', '_', 'Star Trek ' + str(STF[i]['title']))) elif 'Khan' in STF[i]['title']: # fix wrath of khan problem tomato_url = tomato_base_url + re.sub(':', '_II', re.sub(' ', '_', str(STF[i]['title']))) soup = BeautifulSoup(requests.get(tomato_url).text) # rotten tomatoes: website parse tree # add data to pandas data frame if 'year' in STF[i].data.keys() and bool(x['data']): # filter out movies in production and those without MC data df = df.append(pd.DataFrame(data={ 'date': STF[i].data['year'], 'IMDb_rating': [((STF[i].data['rating'] - 1) / 9.0) * 5], # normalised to 5 star system 'Metacritic_rating': [int(x['data']['metascore']) / 20.0], # normalised to 5 star system 'Tomatometer': [ int(min(soup.find('span', {'class': 'meter-value superPageFontColor'}).contents[0])) / 20.0], # rotten tomatoe score (normalised to 5 star system) 'Tomato_user': [ int(filter(str.isdigit, str(soup.find('span', {'class': 'superPageFontColor'}).contents[0]))) / 20.0], # tomato audience score (normalised to 5 star system) 'title': STF[i].data['title'], 'image_url': STF[i]['cover url']}))
At this point one might want to save the data, for example by calling df.to_csv('Star_Trek_movie_data.csv', sep=';')
. The result can be downloaded here.
Data visualisation: using ggplot and matplotlib
I start off by using the ggplot
module as I am very familiar with the syntax from R.
p = ggplot(aes(x='date', y='IMDb_rating'), data=df) + geom_point() + geom_line(size=5, color='orange') + theme_bw() # basic plot p = p + geom_line(aes(x='date', y='Metacritic_rating'), data=df, size=5, color='purple') p = p + geom_line(aes(x='date', y='Tomatometer'), data=df, size=5, color='grey') p = p + geom_line(aes(x='date', y='Tomato_user'), data=df, size=5, color='blue') p = p + ylim(0, 5) + xlim(1975, 2016) + xlab(' ') + ylab(' ') + ggtitle('Star Trek movie ratings') # make axes pretty
The plot p
now holds essentially all the information we need. But it is not pretty yet, as you can see by calling print p
which is what I did to produce the figure above. For visual gimicks we shall leave ggplot
and turn to matplotlib
.
The module matplotlib
works very much like matlab figure production. So, the figure should not be saved in a variable like p
above, but instead be open.
p.make() # exporting the figure to use it in matplotlib
The first thing to improve is to tell the reader what the different lines represent. I personally believe that it is best practice to avoid separate legends and, instead, use intuitive explanations in the figure itself.
plt.text(2017.5, 0, '@ri', color='black') # keep figure open for this to work plt.text(2017.5, 4.25, 'RottennTomatoes', color='grey') # keep figure open for this to work plt.text(2017.5, 3.75, 'RottennTomatoesnusers', color='blue') # keep figure open for this to work plt.text(2017.5, 3.4, 'IMDb users', color='orange') # keep figure open for this to work plt.text(2017.5, 3.2, 'Metacritic', color='purple') # keep figure open for this to work
The result:
How to tell the viewer which movie is where? The film posters are the easiest way to achieve this.
Including an image in a plot is not straight forward. I will use the annotation box approach and hide the box itself behind the image.
First off though, we need the drawing area called axes or ax
.
ax = plt.gca()
Next we define a new function add_image()
to place an image from url
at the coordinates xy
of drawing area ax_
with the image zoom imzoom
.
def add_image(ax_, url, xy, imzoom): if 'http' in url: # image on internet f = urllib2.urlopen(url) else: # image in working directory f = url arr_img = plt.imread(f, format='jpg') imagebox = OffsetImage(arr_img, zoom=imzoom) imagebox.image.axes = ax_ ab = AnnotationBbox(imagebox, xy, xybox=(0., 0.), boxcoords="offset points", pad=-0.5) # hide box behind image ax_.add_artist(ab) return ax_
Adding each film poster is now easy. For the ordinate (y-axis) position, by the way, I chose the average rating.
for i in range(len(df)): # for each Star Trek movie add_image(ax, df['image_url'][i], [df['date'][i], sum(df.iloc[i, 1:5]) / 4.0], 0.3)
I think the x-axis could be simplified.
ax.xaxis.set_ticks(range(1980, 2020, 10)) # minimal x-axis style
I replace the y-axis by star symbols. All in the interest of avoiding text.
ax.yaxis.set_visible(False) # no numerical y-axis at all add_image(ax, 'grey_star.jpg', [1975, 0], 0.05) # zero stars on sort of y-axis for i in range(1, 6): # for each star rating from 1 onwards for j in range(i): # for each individual star add_image(ax, 'gold_star.jpg', [1975 + j * 0.7, i], 0.05)
What stands out the most from the figure is just how awful Star Trek V was. Let’s highlight this.
ax.annotate('The absolute worst movie:n' + df[df['IMDb_rating'] == min(df['IMDb_rating'])]['title'].iloc[0], xy=(1988, 1.5), xytext=(1978, 1), arrowprops=dict(facecolor='black', shrink=0.05))
Finally, the figure dimensions are not optimised for social media like twitter. And the margins are simply too big. The last statements deal with these problems
fig = plt.gcf() # get current figure to show it fig.set_size_inches(1024 / 70, 512 / 70) # reset the figure size to twitter standard fig.savefig('Star Trek movie ratings_dates.png', dpi=96, bbox_inches='tight') # save figure
Like this post? Share it with your followers or follow me on Twitter!