IMDB Movie Analysis

Let us look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)

## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …

skim(movies)

Data summary

Name	movies
Number of rows	2961
Number of columns	11
_______________________
Column type frequency:
character	3
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
title	1	1	83	2907
genre	1	5	11	17
director	1	3	32	1366

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2.00e+03	9.95e+00	1920.0	2.00e+03	2.00e+03	2.01e+03	2.02e+03	▁▁▁▂▇
duration	1	1.10e+02	2.22e+01	37.0	9.50e+01	1.06e+02	1.19e+02	3.30e+02	▃▇▁▁▁
gross	1	5.81e+07	7.25e+07	703.0	1.23e+07	3.47e+07	7.56e+07	7.61e+08	▇▁▁▁▁
budget	1	4.06e+07	4.37e+07	218.0	1.10e+07	2.60e+07	5.50e+07	3.00e+08	▇▂▁▁▁
cast_facebook_likes	1	1.24e+04	2.05e+04	0.0	2.24e+03	4.60e+03	1.69e+04	6.57e+05	▇▁▁▁▁
votes	1	1.09e+05	1.58e+05	5.0	1.99e+04	5.57e+04	1.33e+05	1.69e+06	▇▁▁▁▁
reviews	1	5.03e+02	4.94e+02	2.0	1.99e+02	3.64e+02	6.31e+02	5.31e+03	▇▁▁▁▁
rating	1	6.39e+00	1.05e+00	1.6	5.80e+00	6.50e+00	7.10e+00	9.30e+00	▁▁▆▇▁

There is no missing values as seen in the result below. Not all the entries are distinct as the n_unique values do not match, indicating that several values were repeated. Based on the number of rows 2961 and n_unique 2907, we can conclude that 54 rows are duplicated.

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

gross : The gross earnings in the US box office, not adjusted for inflation budget: The movie’s budget cast_facebook_likes: the number of facebook likes cast memebrs received votes: the number of people who voted for (or rated) the movie in IMDB reviews: the number of reviews for that movie rating: IMDB average rating

unique_movies <- movies %>%
  distinct(title,genre,director,year,.keep_all=TRUE)

new_unique_movies <- movies %>%
  summarise("number of unique movies"=n())

A table with the count of movies by genre, ranked in descending order is produced:

count_movie_genre <- unique_movies %>%
  group_by(genre) %>%
  count(sort=TRUE)

colnames(count_movie_genre) <- c("Genre", "Number of Movies")

count_movie_genre %>% 
  kable()

Genre	Number of Movies
Comedy	844
Action	719
Drama	484
Adventure	281
Crime	198
Biography	135
Horror	128
Animation	35
Fantasy	26
Documentary	25
Mystery	15
Sci-Fi	7
Family	3
Musical	2
Romance	2
Western	2
Thriller	1

A table with the average gross earning and budget (gross and budget) by genre is produced. Let us calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order!

return_movie_genre <- unique_movies %>%
  group_by(genre) %>% 
  summarise(mean_gross = mean(gross),mean_budget = mean(budget)) %>% 
  mutate(return_on_budget = 
           ((mean_gross - mean_budget)/mean_budget)) %>% 
  arrange(-return_on_budget)

colnames(return_movie_genre) <- c("Genre", "Mean gross [$]", "Mean budget [$]", "Return on budget")
  
return_movie_genre %>% 
  kable()

Genre	Mean gross [$]	Mean budget [$]	Return on budget
Musical	9.21e+07	3189500	27.871
Family	1.49e+08	14833333	9.056
Western	2.08e+07	3465000	5.009
Documentary	1.74e+07	5887852	1.947
Horror	3.78e+07	13804379	1.737
Fantasy	4.19e+07	18484615	1.267
Comedy	4.25e+07	24458506	0.737
Mystery	6.91e+07	41500000	0.665
Animation	9.84e+07	61701429	0.595
Biography	4.52e+07	28543696	0.584
Adventure	9.44e+07	64692313	0.458
Drama	3.68e+07	25832605	0.423
Crime	3.76e+07	26527405	0.417
Romance	3.13e+07	25107500	0.245
Action	8.63e+07	70774558	0.219
Sci-Fi	2.98e+07	27607143	0.079
Thriller	2.47e+03	300000	-0.992

A table that shows the top 15 directors who have created the highest gross revenue in the box office is produced.

 top_15_directors <- unique_movies %>%
  select(director, gross) %>% 
  group_by(director) %>% 
  summarise(total_gross = sum(gross), 
            mean_gross = mean(gross), 
            median_gross = median(gross), 
            sd_gross = sd(gross)) %>% 
  top_n(15,total_gross) %>% 
  arrange(-total_gross)

colnames(top_15_directors) <- c("Director", "Total gross [$]", "Mean gross [$]", "Median gross [$]", "Standard deviation of gross [$]")
  
top_15_directors %>% 
  kable()

Director	Total gross [$]	Mean gross [$]	Median gross [$]	Standard deviation of gross [$]
Steven Spielberg	4.01e+09	1.75e+08	1.64e+08	1.01e+08
Michael Bay	2.20e+09	1.83e+08	1.68e+08	1.26e+08
James Cameron	1.91e+09	3.18e+08	1.76e+08	3.09e+08
Christopher Nolan	1.81e+09	2.27e+08	1.97e+08	1.87e+08
George Lucas	1.74e+09	3.48e+08	3.80e+08	1.46e+08
Robert Zemeckis	1.62e+09	1.25e+08	1.01e+08	9.13e+07
Tim Burton	1.56e+09	1.11e+08	6.98e+07	9.93e+07
Sam Raimi	1.44e+09	1.80e+08	1.38e+08	1.75e+08
Clint Eastwood	1.38e+09	7.25e+07	4.67e+07	7.55e+07
Francis Lawrence	1.36e+09	2.72e+08	2.82e+08	1.35e+08
Ron Howard	1.34e+09	1.11e+08	1.02e+08	8.19e+07
Gore Verbinski	1.33e+09	1.90e+08	1.23e+08	1.54e+08
Andrew Adamson	1.14e+09	2.84e+08	2.80e+08	1.21e+08
Shawn Levy	1.13e+09	1.03e+08	8.55e+07	6.55e+07
Ridley Scott	1.13e+09	8.06e+07	4.78e+07	6.88e+07

Finally, ratings. Let’s produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.

dist_by_genre <- unique_movies %>%
  select(genre, rating) %>% 
  group_by(genre) %>% 
  summarise(mean_rating = mean(rating), 
            min_rating = min(rating), 
            max_rating = max(rating), 
            sd_rating = sd(rating)) 

colnames(dist_by_genre) <- c("Genre", "Mean rating", "Min rating", "Max rating", "Standard deviation of rating")


dist_by_genre %>% 
  kable()

Genre	Mean rating	Min rating	Max rating	Standard deviation of rating
Action	6.23	2.1	9.0	1.039
Adventure	6.51	2.3	8.6	1.106
Animation	6.65	4.5	8.0	0.968
Biography	7.11	4.5	8.9	0.760
Comedy	6.11	1.9	8.8	1.024
Crime	6.92	4.8	9.3	0.853
Documentary	6.66	1.6	8.5	1.767
Drama	6.74	2.1	8.8	0.915
Family	6.50	5.7	7.9	1.217
Fantasy	6.08	4.3	7.9	0.953
Horror	5.79	3.6	8.5	0.987
Musical	6.75	6.3	7.2	0.636
Mystery	6.84	4.6	8.5	0.910
Romance	6.65	6.2	7.1	0.636
Sci-Fi	6.66	5.0	8.2	1.094
Thriller	4.80	4.8	4.8	NA
Western	5.70	4.1	7.3	2.263

unique_movies %>%
  ggplot(aes(x=rating)) + 
  geom_histogram() +
  facet_wrap(vars(genre)) +
  labs(title = "Rating statistics of each genre", x = "Rating", y = "Count")

Let us examine the relationship between gross and cast_facebook_likes. A scatterplot has been produced discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office.

unique_movies %>%
  ggplot(aes(y=gross, x=cast_facebook_likes)) + 
  geom_point() +
  labs(title = "Gross vs. Cast facebook likes", x = "Cast Facebook Likes", y = "Gross [$]")+
  NULL

It’s unlikely that number of facebook likes the cast received will be a good predictor of how much money a movie will make. X-axis is the independent variable “cast_facebook_likes”, Y-axis is the dependent variable “gross”. There is no obvious correlation between the two variables.

Let us examine the relationship between gross and budget. A scatterplot has been produced to discuss whether budget is likely to be a good predictor of how much money a movie will make at the box office.

unique_movies %>%
  ggplot(aes(y=gross, x=budget)) + 
  geom_point() +
  labs(title = "Gross vs. Budget", x = "Budget [$]", y = "Gross [$]")+
  NULL

It can be inferred from the plot that budget is a weak predictor of how much money a movie will make. There is a lot of variation and scatter in the data thus the correlation is very weak.

Let us examine the relationship between gross and rating. A scatterplot has been produced, faceted by genre and let’d discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office.

unique_movies %>%
  ggplot(aes(y=gross, x=rating)) + 
  geom_point() +
  facet_wrap(vars(genre))+
  labs(title = "Gross vs. the rating of the movie genre", x = "Rating", y = "Gross [$]")+
  NULL

In the genre where there are enough data instances, the rating can be a good predictor for how much money a movie will make at the box office. This is expected because movies that are more popular are more likely to make more money. In the Action and Adventure genres, where there are lots of data points, the trend seems to be exponential. In the genres with few data points, it’s hard to determine what the relationship is between the two variables.

IMDB ratings: Differences between directors

Let us run a hypothesis test, using both the t.test command and the infer package to simulate from a null distribution, where we assume zero difference between the two. H0:μA−μB=0 H1:μA−μB≠0 t-stat is equal to 3 and p-value is equal to 0.01. Therefore, we can reject the null hypothesis. Therefore, we can conclude that we are 95% confident that movies of Stephen Spielberg have higher average ratings than those of Tim Burton. Let us load the data and examine its structure!

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)

## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …

#calculate the week4 and week5 confidence intervals
rating_comparison <- movies %>% 
  filter(director=="Steven Spielberg" | director=="Tim Burton") %>% 
  group_by(director) %>% 
  summarise(avg_rating = mean(rating),
            sd_rating = sd(rating, na.rm=TRUE),
            count_rating = n(),
            se_rating = sd_rating / sqrt(count_rating),
            ci_rating_up = avg_rating + qt(.975, count_rating-1)*se_rating,
            ci_rating_dw = avg_rating - qt(.975, count_rating-1)*se_rating
            )

#plot the confidence intervals
rating_comparison %>% 
  ggplot(aes(x=avg_rating, y=director, color=director))+
    geom_rect(fill="grey",alpha=0.5, color = "grey",
            aes(xmin=max(ci_rating_dw),
                xmax=min(ci_rating_up),
                ymin=-Inf,
                ymax=Inf))+
  geom_errorbarh(aes(xmin=ci_rating_dw,xmax=ci_rating_up))+
  geom_point(aes(x=avg_rating, y=director), size=3)+
  geom_text(aes(label=round(avg_rating, digits=2)), vjust = -1.5)+
  labs(title="Do Spielberg and Burton have the same mean IMDB ratings?",
       subtitle = "95% confidence intervals overlap",
       x = "Mean IMDB Rating")

#calculate via t-test
comparison_rating <- movies %>% 
  filter(director=="Steven Spielberg" | director=="Tim Burton")

t.test(rating ~ director, data = comparison_rating)

## 
##  Welch Two Sample t-test
## 
## data:  rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means between group Steven Spielberg and group Tim Burton is not equal to 0
## 95 percent confidence interval:
##  0.16 1.13
## sample estimates:
## mean in group Steven Spielberg       mean in group Tim Burton 
##                           7.57                           6.93

#calculate using infer package
set.seed(1234)
ratings_in_null <- comparison_rating %>% 
  
  specify(rating ~ director) %>% 
  
  hypothesise(null="independence") %>% 
  
  generate(reps = 100, type = "permute") %>%
  
  calculate(stat = "diff in means", order = c("Tim Burton",
                                              "Steven Spielberg"))

#calculate the observed difference
observed_difference <- comparison_rating %>%
  specify(rating ~ director) %>%
  calculate(stat = "diff in means")

#calculate the p-value of the differences
ratings_in_null %>% 
  get_pvalue(obs_stat=observed_difference, direction="both")

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1    0.02

IMDB Movies + Ratings Analysis

2017-10-31T22:26:09-05:00

IMDB Movie Analysis

IMDB ratings: Differences between directors