The TMDb dataset we are going to analyze contains information about 10,000 movies collected from the Movie Database, TMDb. It provides details about revenue, budget, votes, cast, production companies among others related to movies that span a period of more than 50 years.
The following questions will be answered as we analyze the data.
# import the necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Initialize seaborn and magic commands
sns.set()
%matplotlib inline
# load tmdb-movies csv into a dataframe
movies_df = pd.read_csv('tmdb-movies.csv')
# Examine first 5 rows
movies_df.head(5)
# check the number of rows and columns
movies_df.shape
# Get a quick statistics of the dataset
movies_df.describe()
# View more details about the dataset
movies_df.info()
# check the columns with empty values
movies_df.isnull().sum()
# check for number of duplicates
movies_df.duplicated().sum()
We observe from our initial checks on our dataset that there are duplicates and some missing values.
We are going to remove the duplicates and replace missing values in the following columns: cast, director, keywords, genres, production_companies
We will also drop some columns that will not be relevant to the questions we want to address i.e. imdb_id, homepage, tagline and overview
We saw that the budget, revenue, budget_adj, revenue_adj columns had a min value of 0.0 which implies some of the rows in these columns had a value of 0.0 which is not desirable for our analysis. These rows will be removed as well.
# remove duplicates
movies_df.drop_duplicates(inplace=True)
# verify duplicates removal
movies_df.duplicated().sum()
# confirm the new number of rows
movies_df.shape
# drop imdb_id, homepage, tagline and overview columns.
movies_df.drop(['imdb_id', 'homepage','tagline', 'overview'], axis=1, inplace=True )
movies_df.shape
# remove rows if the budget or revenue or budget_adj or revenue_adj is 0.0
movies_df = movies_df.loc[(movies_df.budget * movies_df.revenue * movies_df.budget_adj* movies_df.revenue_adj) != 0]
movies_df.shape
# fill the missing values in the following columns with the values indicated
# cast = 'no_cast'
# director = 'no_director'
# keywords = 'no_keywords'
# genres = 'no_genres'
# production_companies = 'no_production_companies'
columns_to_fillna = ['cast', 'director', 'keywords', 'genres', 'production_companies']
for column in columns_to_fillna:
movies_df[column] = movies_df[column].fillna('no_'+column)
# check the number of null values again
movies_df.isnull().sum()
def generate_plot(x_val, y_val, fig_size, title, x_label, y_label):
"""
This functions takes inputs for a bar graph and produces a plot based on the inputs
"""
plt.subplots(figsize=fig_size)
sns.barplot(x_val, y_val)
plt.title(title, fontsize=30)
plt.xlabel(x_label, fontsize=20)
plt.ylabel(y_label, fontsize=20);
def generate_value_and_count(data):
"""
This functions takes a column and separates the pipe-separated values and return a dict of
the value and the number of times it occurs
"""
val_list = [val.split('|') for val in data]
top_val_list = []
for new_val in val_list:
for single in new_val:
top_val_list.append(single)
# get the value and count of each item in the top_val_list
val_and_count = dict()
for i in top_val_list:
val_and_count[i] = val_and_count.get(i, 0)+1
return val_and_count
I used vote_average as the metric for popularity with the understanding that the average votes given to a movies reflects its popularity among consumers
Drama, Western, Romance were the popular genres in the earliest decade.
Western, Drama, Thriller became popular in the most recent.
We therefore observe that both Western and Drama genres remained popular in both past and recent times.
The last is a visualization of the various genres associated with the 5 top popular movies for each year
# Get the Most Popular Genre in Each Release Year
popular_genre_per_year = movies_df.groupby(['release_year'])[['vote_average','genres']].max()
plt.subplots(figsize=(25, 20))
graph = sns.barplot(
popular_genre_per_year.index,
popular_genre_per_year['vote_average'],
hue=popular_genre_per_year['genres'],
dodge=False,
palette='muted',
)
graph.set_xticklabels(graph.get_xticklabels(),
rotation=90,
fontweight='light',fontsize='xx-large'
)
graph.axes.set_title("The Yearly Most Popular Movie Genre From 1960 to 2015",fontsize=40)
graph.set_xlabel("Genres",fontsize=30)
graph.set_ylabel("Popularity",fontsize=30);
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.1, prop={'size': 20});
# Find the most popular genre in the first earliest decade
earliest_decade = popular_genre_per_year.head(10)['genres']
earliest_decade_genres = generate_value_and_count(data=earliest_decade)
earliest_decade_genres_df = pd.DataFrame.from_dict(earliest_decade_genres, orient="index")
# Generate the Plot
generate_plot(x_val=earliest_decade_genres_df[0],
y_val=earliest_decade_genres_df.index,
fig_size=(20,10),
title='Most Popular Genres In Earliest Decade',
x_label='Number of Movies',
y_label='Genre'
)
# Find the most popular genre in the most recent decade
recent_decade = popular_genre_per_year.tail(10)['genres']
recent_decade_genres = generate_value_and_count(data=recent_decade)
recent_decade_genres_df = pd.DataFrame.from_dict(recent_decade_genres, orient="index")
# Generate the Plot
generate_plot(x_val=recent_decade_genres_df[0],
y_val=recent_decade_genres_df.index,
fig_size=(20,10),
title='Most Popular Genres In Recent Decade',
x_label='Number of Movies',
y_label='Genre'
)
# Group The Genres by Year and Find the Most Popular Ones in Each Year using vote_average as the popularity metric
for year in np.arange(1960,2016, 10): # Interval can be changed if all the years need to appear.
df = movies_df.query('release_year == @year').groupby('genres').mean().sort_values(by=['vote_average'], ascending=False).head(5)
generate_plot(x_val=df.index,
y_val=df['vote_average'],
fig_size=(20,5),
title=f"5 Most Popular Genres Each {year}",
x_label=year,
y_label='Popularity'
)
The change in revenue from year to year does not seem to follow any particular pattern. This could be due to unequal number of movies produced in each year, directly affecting the total sum of revenue generated by year.
The difference in budget and revenue shows a significant increase in the last decade as captured the dataset. This could be attributed to efficient and improved technologies and distribution channels in recent times.
A movie with a high budget is likely to help in all marketing and sales efforts to boost revenue generated.
# Find the Sum of the Various Columns According to the Year the Movies Were Released
yearly_movies_sum = movies_df.groupby('release_year').sum()
# Find The Yearly Change in Revenue
yearly_movies_sum.apply(lambda x:x.diff().fillna(0))[['revenue']].plot()
plt.xlabel('Year Released')
plt.ylabel('Change in Revenue(USD)')
plt.title('Yearly Change In Revenue');
# Compare the Yearly sum for budget_adj and revenue_adj
yearly_movies_sum[['budget_adj', 'revenue_adj']].plot()
plt.xlabel('Year Released')
plt.ylabel('Revenue & Budget (USD)')
plt.title('Revenue By Release Year');
# Compare the Yearly sum for budget and revenue
yearly_movies_sum[['budget', 'revenue']].plot()
plt.xlabel('Year Released')
plt.ylabel('Revenue & Budget (USD)')
plt.title('Revenue By Release Year');
I chose movies with a Revenue Greater than the 90th Percentile as High Revenue Movies. I looked at the distribution of the other columns to get a fair idea of the properties associated with the high revenue movies
# I chose movies with a Revenue Greater than the 90th Percentile as High Revenue Movies
# Calculate the 90th Percentile Revenue
ninety_percentile = np.percentile(movies_df['revenue_adj'], 90)
# Filter the movies with movies with revenue greater than 90th Percentile
highest_revenue_movies = movies_df.query('revenue_adj > @ninety_percentile')
# Check the Budget for of High Revenue Movies
highest_revenue_movies['budget_adj'].hist()
plt.xlabel('Budget (USD)')
plt.ylabel('Number of Movies')
plt.title('Number of Movies By Budget');
# Check the Vote Average (Popularity Metric) of High Revenue Movies
highest_revenue_movies['vote_average'].hist()
plt.xlabel('Average Votes')
plt.ylabel('Number of Movies')
plt.title('Number of Movies By Average Vote');
# Check the Runtime of High Revenue Movies - How Long The Movies span
highest_revenue_movies['runtime'].hist()
plt.xlabel('Runtime')
plt.ylabel('Number of Movies')
plt.title('Number of Movies By Runtime');
# Check the Release Year of High Revenue Movies
highest_revenue_movies['release_year'].hist()
plt.xlabel('Year of Release')
plt.ylabel('Number of Movies')
plt.title('Number of Movies By Year of Release');
# Find the top 20 revenue movies
top_20_revenue = highest_revenue_movies.sort_values(by=['revenue_adj'], ascending=False).head(20)
We realize that for some movies, they had more than 1 director, directing the movies. I therefore looked at the number of times an individual director, directed or was part of the team that directed a movie.
Steven Spielberg was first, directing 3 top revenue movies, followed James Cameron, Josh Whedon and Wolfgang Reitherman with 3 movies each.
# Find the Director That Featured on Most of the Highest Revenue Movies
director_counts = generate_value_and_count(data=top_20_revenue['director'])
director_count_df = pd.DataFrame.from_dict(director_counts, orient="index")
# Generate the Plot
generate_plot(x_val=director_count_df[0],
y_val=director_count_df.index,
fig_size=(20,10),
title='Number of Top Movies Per Director',
x_label='Number Of Top Movies',
y_label='Director'
)
Over here we first pick the actors who featured in each of the top 20 revenue movies. We then find the number of movies that all the individual actors featured in. The top 3 who were most present in the top 20 high revenue movies were Carrie Fisher, Mark Hamill and Harrison Ford
# Find the Actor That Featured on Most of the Highest Revenue Movies
actor_counts = generate_value_and_count(data=top_20_revenue['cast'])
actor_count_df = pd.DataFrame.from_dict(actor_counts, orient="index").sort_values(by=[0], ascending=False).head(20)
# Generate the Plot
generate_plot(x_val=actor_count_df[0],
y_val=actor_count_df.index,
fig_size=(15,10),
title='Top Movies Per Actor',
x_label='Number Of Top Movies',
y_label='Actors'
)
Twentieth Century Fox Film Corporation was first followed by Lucas Film and Universal Pictures as the production companies that produced or were part of the production of the top 20 revenue movies. We used the number of occurences of a production company in all the top 20 revenue movies as the metric for this.
# Find the Production Company That Featured in Most of the Highest Revenue Movies
company_counts = generate_value_and_count(data=top_20_revenue['production_companies'])
company_count_df = pd.DataFrame.from_dict(company_counts, orient="index")
# Generate the Plot
generate_plot(x_val=company_count_df[0],
y_val=company_count_df.index,
fig_size=(10,10),
title='Top Movies Per Production Company',
x_label='Number Of Top Movies',
y_label='Production Company'
)
We found out that Avatar is the Highest Revenue Movie from 1960 to 2015
# Generate plots for highest revenue movies
generate_plot(x_val=top_20_revenue['revenue_adj'],
y_val=top_20_revenue['original_title'],
fig_size=(10,10),
title='Top Revenue Movies From 1960 to 2015',
x_label='Revenue(USD)',
y_label='Movie Title'
)
We tried to answer all the questions we asked ourselves at the start of this analysis though we realize that the data that was left after cleaning the data was very small.
The other challenge we observe is the unequal distribution of the number of movies produced each year. It follows that, the comparison among the various years especially in terms of total revevenue generated would not be fair.
This is due to the fact that a lot of the data on revenue was 0 and not desirable and all those corresponding rows had to be dropped. All the observations are therefore tentative.
We observe that: