IMDB Movie Analysis

Let us look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
skim(movies)
Data summary
Name movies
Number of rows 2961
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.95e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.22e+01 37.0 9.50e+01 1.06e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.81e+07 7.25e+07 703.0 1.23e+07 3.47e+07 7.56e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.06e+07 4.37e+07 218.0 1.10e+07 2.60e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.24e+04 2.05e+04 0.0 2.24e+03 4.60e+03 1.69e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.58e+05 5.0 1.99e+04 5.57e+04 1.33e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 5.03e+02 4.94e+02 2.0 1.99e+02 3.64e+02 6.31e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.05e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁

There is no missing values as seen in the result below. Not all the entries are distinct as the n_unique values do not match, indicating that several values were repeated. Based on the number of rows 2961 and n_unique 2907, we can conclude that 54 rows are duplicated.

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

gross : The gross earnings in the US box office, not adjusted for inflation budget: The movie’s budget cast_facebook_likes: the number of facebook likes cast memebrs received votes: the number of people who voted for (or rated) the movie in IMDB reviews: the number of reviews for that movie rating: IMDB average rating

unique_movies <- movies %>%
  distinct(title,genre,director,year,.keep_all=TRUE)
new_unique_movies <- movies %>%
  summarise("number of unique movies"=n()) 

A table with the count of movies by genre, ranked in descending order is produced:

count_movie_genre <- unique_movies %>%
  group_by(genre) %>%
  count(sort=TRUE)

colnames(count_movie_genre) <- c("Genre", "Number of Movies")

count_movie_genre %>% 
  kable()
Genre Number of Movies
Comedy 844
Action 719
Drama 484
Adventure 281
Crime 198
Biography 135
Horror 128
Animation 35
Fantasy 26
Documentary 25
Mystery 15
Sci-Fi 7
Family 3
Musical 2
Romance 2
Western 2
Thriller 1

A table with the average gross earning and budget (gross and budget) by genre is produced. Let us calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order!

return_movie_genre <- unique_movies %>%
  group_by(genre) %>% 
  summarise(mean_gross = mean(gross),mean_budget = mean(budget)) %>% 
  mutate(return_on_budget = 
           ((mean_gross - mean_budget)/mean_budget)) %>% 
  arrange(-return_on_budget)

colnames(return_movie_genre) <- c("Genre", "Mean gross [$]", "Mean budget [$]", "Return on budget")
  
return_movie_genre %>% 
  kable()
Genre Mean gross [$] Mean budget [$] Return on budget
Musical 9.21e+07 3189500 27.871
Family 1.49e+08 14833333 9.056
Western 2.08e+07 3465000 5.009
Documentary 1.74e+07 5887852 1.947
Horror 3.78e+07 13804379 1.737
Fantasy 4.19e+07 18484615 1.267
Comedy 4.25e+07 24458506 0.737
Mystery 6.91e+07 41500000 0.665
Animation 9.84e+07 61701429 0.595
Biography 4.52e+07 28543696 0.584
Adventure 9.44e+07 64692313 0.458
Drama 3.68e+07 25832605 0.423
Crime 3.76e+07 26527405 0.417
Romance 3.13e+07 25107500 0.245
Action 8.63e+07 70774558 0.219
Sci-Fi 2.98e+07 27607143 0.079
Thriller 2.47e+03 300000 -0.992

A table that shows the top 15 directors who have created the highest gross revenue in the box office is produced.

 top_15_directors <- unique_movies %>%
  select(director, gross) %>% 
  group_by(director) %>% 
  summarise(total_gross = sum(gross), 
            mean_gross = mean(gross), 
            median_gross = median(gross), 
            sd_gross = sd(gross)) %>% 
  top_n(15,total_gross) %>% 
  arrange(-total_gross)

colnames(top_15_directors) <- c("Director", "Total gross [$]", "Mean gross [$]", "Median gross [$]", "Standard deviation of gross [$]")
  
top_15_directors %>% 
  kable()
Director Total gross [$] Mean gross [$] Median gross [$] Standard deviation of gross [$]
Steven Spielberg 4.01e+09 1.75e+08 1.64e+08 1.01e+08
Michael Bay 2.20e+09 1.83e+08 1.68e+08 1.26e+08
James Cameron 1.91e+09 3.18e+08 1.76e+08 3.09e+08
Christopher Nolan 1.81e+09 2.27e+08 1.97e+08 1.87e+08
George Lucas 1.74e+09 3.48e+08 3.80e+08 1.46e+08
Robert Zemeckis 1.62e+09 1.25e+08 1.01e+08 9.13e+07
Tim Burton 1.56e+09 1.11e+08 6.98e+07 9.93e+07
Sam Raimi 1.44e+09 1.80e+08 1.38e+08 1.75e+08
Clint Eastwood 1.38e+09 7.25e+07 4.67e+07 7.55e+07
Francis Lawrence 1.36e+09 2.72e+08 2.82e+08 1.35e+08
Ron Howard 1.34e+09 1.11e+08 1.02e+08 8.19e+07
Gore Verbinski 1.33e+09 1.90e+08 1.23e+08 1.54e+08
Andrew Adamson 1.14e+09 2.84e+08 2.80e+08 1.21e+08
Shawn Levy 1.13e+09 1.03e+08 8.55e+07 6.55e+07
Ridley Scott 1.13e+09 8.06e+07 4.78e+07 6.88e+07

Finally, ratings. Let’s produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.

dist_by_genre <- unique_movies %>%
  select(genre, rating) %>% 
  group_by(genre) %>% 
  summarise(mean_rating = mean(rating), 
            min_rating = min(rating), 
            max_rating = max(rating), 
            sd_rating = sd(rating)) 

colnames(dist_by_genre) <- c("Genre", "Mean rating", "Min rating", "Max rating", "Standard deviation of rating")


dist_by_genre %>% 
  kable()
Genre Mean rating Min rating Max rating Standard deviation of rating
Action 6.23 2.1 9.0 1.039
Adventure 6.51 2.3 8.6 1.106
Animation 6.65 4.5 8.0 0.968
Biography 7.11 4.5 8.9 0.760
Comedy 6.11 1.9 8.8 1.024
Crime 6.92 4.8 9.3 0.853
Documentary 6.66 1.6 8.5 1.767
Drama 6.74 2.1 8.8 0.915
Family 6.50 5.7 7.9 1.217
Fantasy 6.08 4.3 7.9 0.953
Horror 5.79 3.6 8.5 0.987
Musical 6.75 6.3 7.2 0.636
Mystery 6.84 4.6 8.5 0.910
Romance 6.65 6.2 7.1 0.636
Sci-Fi 6.66 5.0 8.2 1.094
Thriller 4.80 4.8 4.8 NA
Western 5.70 4.1 7.3 2.263
unique_movies %>%
  ggplot(aes(x=rating)) + 
  geom_histogram() +
  facet_wrap(vars(genre)) +
  labs(title = "Rating statistics of each genre", x = "Rating", y = "Count")

Let us examine the relationship between gross and cast_facebook_likes. A scatterplot has been produced discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office.

unique_movies %>%
  ggplot(aes(y=gross, x=cast_facebook_likes)) + 
  geom_point() +
  labs(title = "Gross vs. Cast facebook likes", x = "Cast Facebook Likes", y = "Gross [$]")+
  NULL

It’s unlikely that number of facebook likes the cast received will be a good predictor of how much money a movie will make. X-axis is the independent variable “cast_facebook_likes”, Y-axis is the dependent variable “gross”. There is no obvious correlation between the two variables.

Let us examine the relationship between gross and budget. A scatterplot has been produced to discuss whether budget is likely to be a good predictor of how much money a movie will make at the box office.

unique_movies %>%
  ggplot(aes(y=gross, x=budget)) + 
  geom_point() +
  labs(title = "Gross vs. Budget", x = "Budget [$]", y = "Gross [$]")+
  NULL

It can be inferred from the plot that budget is a weak predictor of how much money a movie will make. There is a lot of variation and scatter in the data thus the correlation is very weak.

Let us examine the relationship between gross and rating. A scatterplot has been produced, faceted by genre and let’d discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office.

unique_movies %>%
  ggplot(aes(y=gross, x=rating)) + 
  geom_point() +
  facet_wrap(vars(genre))+
  labs(title = "Gross vs. the rating of the movie genre", x = "Rating", y = "Gross [$]")+
  NULL

In the genre where there are enough data instances, the rating can be a good predictor for how much money a movie will make at the box office. This is expected because movies that are more popular are more likely to make more money. In the Action and Adventure genres, where there are lots of data points, the trend seems to be exponential. In the genres with few data points, it’s hard to determine what the relationship is between the two variables.

IMDB ratings: Differences between directors

Let us run a hypothesis test, using both the t.test command and the infer package to simulate from a null distribution, where we assume zero difference between the two. H0:μA−μB=0 H1:μA−μB≠0 t-stat is equal to 3 and p-value is equal to 0.01. Therefore, we can reject the null hypothesis. Therefore, we can conclude that we are 95% confident that movies of Stephen Spielberg have higher average ratings than those of Tim Burton. Let us load the data and examine its structure!

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
#calculate the week4 and week5 confidence intervals
rating_comparison <- movies %>% 
  filter(director=="Steven Spielberg" | director=="Tim Burton") %>% 
  group_by(director) %>% 
  summarise(avg_rating = mean(rating),
            sd_rating = sd(rating, na.rm=TRUE),
            count_rating = n(),
            se_rating = sd_rating / sqrt(count_rating),
            ci_rating_up = avg_rating + qt(.975, count_rating-1)*se_rating,
            ci_rating_dw = avg_rating - qt(.975, count_rating-1)*se_rating
            )

#plot the confidence intervals
rating_comparison %>% 
  ggplot(aes(x=avg_rating, y=director, color=director))+
    geom_rect(fill="grey",alpha=0.5, color = "grey",
            aes(xmin=max(ci_rating_dw),
                xmax=min(ci_rating_up),
                ymin=-Inf,
                ymax=Inf))+
  geom_errorbarh(aes(xmin=ci_rating_dw,xmax=ci_rating_up))+
  geom_point(aes(x=avg_rating, y=director), size=3)+
  geom_text(aes(label=round(avg_rating, digits=2)), vjust = -1.5)+
  labs(title="Do Spielberg and Burton have the same mean IMDB ratings?",
       subtitle = "95% confidence intervals overlap",
       x = "Mean IMDB Rating")

#calculate via t-test
comparison_rating <- movies %>% 
  filter(director=="Steven Spielberg" | director=="Tim Burton")

t.test(rating ~ director, data = comparison_rating)
## 
##  Welch Two Sample t-test
## 
## data:  rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means between group Steven Spielberg and group Tim Burton is not equal to 0
## 95 percent confidence interval:
##  0.16 1.13
## sample estimates:
## mean in group Steven Spielberg       mean in group Tim Burton 
##                           7.57                           6.93
#calculate using infer package
set.seed(1234)
ratings_in_null <- comparison_rating %>% 
  
  specify(rating ~ director) %>% 
  
  hypothesise(null="independence") %>% 
  
  generate(reps = 100, type = "permute") %>%
  
  calculate(stat = "diff in means", order = c("Tim Burton",
                                              "Steven Spielberg"))

#calculate the observed difference
observed_difference <- comparison_rating %>%
  specify(rating ~ director) %>%
  calculate(stat = "diff in means")

#calculate the p-value of the differences
ratings_in_null %>% 
  get_pvalue(obs_stat=observed_difference, direction="both")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1    0.02