Let us look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset
movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
skim(movies)
| Name | movies |
| Number of rows | 2961 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.95e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.22e+01 | 37.0 | 9.50e+01 | 1.06e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.81e+07 | 7.25e+07 | 703.0 | 1.23e+07 | 3.47e+07 | 7.56e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.06e+07 | 4.37e+07 | 218.0 | 1.10e+07 | 2.60e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.24e+04 | 2.05e+04 | 0.0 | 2.24e+03 | 4.60e+03 | 1.69e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.58e+05 | 5.0 | 1.99e+04 | 5.57e+04 | 1.33e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 5.03e+02 | 4.94e+02 | 2.0 | 1.99e+02 | 3.64e+02 | 6.31e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.05e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
There is no missing values as seen in the result below. Not all the entries are distinct as the n_unique values do not match, indicating that several values were repeated. Based on the number of rows 2961 and n_unique 2907, we can conclude that 54 rows are duplicated.
Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:
gross : The gross earnings in the US box office, not adjusted for inflation budget: The movie’s budget cast_facebook_likes: the number of facebook likes cast memebrs received votes: the number of people who voted for (or rated) the movie in IMDB reviews: the number of reviews for that movie rating: IMDB average rating
unique_movies <- movies %>%
distinct(title,genre,director,year,.keep_all=TRUE)
new_unique_movies <- movies %>%
summarise("number of unique movies"=n())
A table with the count of movies by genre, ranked in descending order is produced:
count_movie_genre <- unique_movies %>%
group_by(genre) %>%
count(sort=TRUE)
colnames(count_movie_genre) <- c("Genre", "Number of Movies")
count_movie_genre %>%
kable()
| Genre | Number of Movies |
|---|---|
| Comedy | 844 |
| Action | 719 |
| Drama | 484 |
| Adventure | 281 |
| Crime | 198 |
| Biography | 135 |
| Horror | 128 |
| Animation | 35 |
| Fantasy | 26 |
| Documentary | 25 |
| Mystery | 15 |
| Sci-Fi | 7 |
| Family | 3 |
| Musical | 2 |
| Romance | 2 |
| Western | 2 |
| Thriller | 1 |
A table with the average gross earning and budget (gross and budget) by genre is produced. Let us calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order!
return_movie_genre <- unique_movies %>%
group_by(genre) %>%
summarise(mean_gross = mean(gross),mean_budget = mean(budget)) %>%
mutate(return_on_budget =
((mean_gross - mean_budget)/mean_budget)) %>%
arrange(-return_on_budget)
colnames(return_movie_genre) <- c("Genre", "Mean gross [$]", "Mean budget [$]", "Return on budget")
return_movie_genre %>%
kable()
| Genre | Mean gross [$] | Mean budget [$] | Return on budget |
|---|---|---|---|
| Musical | 9.21e+07 | 3189500 | 27.871 |
| Family | 1.49e+08 | 14833333 | 9.056 |
| Western | 2.08e+07 | 3465000 | 5.009 |
| Documentary | 1.74e+07 | 5887852 | 1.947 |
| Horror | 3.78e+07 | 13804379 | 1.737 |
| Fantasy | 4.19e+07 | 18484615 | 1.267 |
| Comedy | 4.25e+07 | 24458506 | 0.737 |
| Mystery | 6.91e+07 | 41500000 | 0.665 |
| Animation | 9.84e+07 | 61701429 | 0.595 |
| Biography | 4.52e+07 | 28543696 | 0.584 |
| Adventure | 9.44e+07 | 64692313 | 0.458 |
| Drama | 3.68e+07 | 25832605 | 0.423 |
| Crime | 3.76e+07 | 26527405 | 0.417 |
| Romance | 3.13e+07 | 25107500 | 0.245 |
| Action | 8.63e+07 | 70774558 | 0.219 |
| Sci-Fi | 2.98e+07 | 27607143 | 0.079 |
| Thriller | 2.47e+03 | 300000 | -0.992 |
A table that shows the top 15 directors who have created the highest gross revenue in the box office is produced.
top_15_directors <- unique_movies %>%
select(director, gross) %>%
group_by(director) %>%
summarise(total_gross = sum(gross),
mean_gross = mean(gross),
median_gross = median(gross),
sd_gross = sd(gross)) %>%
top_n(15,total_gross) %>%
arrange(-total_gross)
colnames(top_15_directors) <- c("Director", "Total gross [$]", "Mean gross [$]", "Median gross [$]", "Standard deviation of gross [$]")
top_15_directors %>%
kable()
| Director | Total gross [$] | Mean gross [$] | Median gross [$] | Standard deviation of gross [$] |
|---|---|---|---|---|
| Steven Spielberg | 4.01e+09 | 1.75e+08 | 1.64e+08 | 1.01e+08 |
| Michael Bay | 2.20e+09 | 1.83e+08 | 1.68e+08 | 1.26e+08 |
| James Cameron | 1.91e+09 | 3.18e+08 | 1.76e+08 | 3.09e+08 |
| Christopher Nolan | 1.81e+09 | 2.27e+08 | 1.97e+08 | 1.87e+08 |
| George Lucas | 1.74e+09 | 3.48e+08 | 3.80e+08 | 1.46e+08 |
| Robert Zemeckis | 1.62e+09 | 1.25e+08 | 1.01e+08 | 9.13e+07 |
| Tim Burton | 1.56e+09 | 1.11e+08 | 6.98e+07 | 9.93e+07 |
| Sam Raimi | 1.44e+09 | 1.80e+08 | 1.38e+08 | 1.75e+08 |
| Clint Eastwood | 1.38e+09 | 7.25e+07 | 4.67e+07 | 7.55e+07 |
| Francis Lawrence | 1.36e+09 | 2.72e+08 | 2.82e+08 | 1.35e+08 |
| Ron Howard | 1.34e+09 | 1.11e+08 | 1.02e+08 | 8.19e+07 |
| Gore Verbinski | 1.33e+09 | 1.90e+08 | 1.23e+08 | 1.54e+08 |
| Andrew Adamson | 1.14e+09 | 2.84e+08 | 2.80e+08 | 1.21e+08 |
| Shawn Levy | 1.13e+09 | 1.03e+08 | 8.55e+07 | 6.55e+07 |
| Ridley Scott | 1.13e+09 | 8.06e+07 | 4.78e+07 | 6.88e+07 |
Finally, ratings. Let’s produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.
dist_by_genre <- unique_movies %>%
select(genre, rating) %>%
group_by(genre) %>%
summarise(mean_rating = mean(rating),
min_rating = min(rating),
max_rating = max(rating),
sd_rating = sd(rating))
colnames(dist_by_genre) <- c("Genre", "Mean rating", "Min rating", "Max rating", "Standard deviation of rating")
dist_by_genre %>%
kable()
| Genre | Mean rating | Min rating | Max rating | Standard deviation of rating |
|---|---|---|---|---|
| Action | 6.23 | 2.1 | 9.0 | 1.039 |
| Adventure | 6.51 | 2.3 | 8.6 | 1.106 |
| Animation | 6.65 | 4.5 | 8.0 | 0.968 |
| Biography | 7.11 | 4.5 | 8.9 | 0.760 |
| Comedy | 6.11 | 1.9 | 8.8 | 1.024 |
| Crime | 6.92 | 4.8 | 9.3 | 0.853 |
| Documentary | 6.66 | 1.6 | 8.5 | 1.767 |
| Drama | 6.74 | 2.1 | 8.8 | 0.915 |
| Family | 6.50 | 5.7 | 7.9 | 1.217 |
| Fantasy | 6.08 | 4.3 | 7.9 | 0.953 |
| Horror | 5.79 | 3.6 | 8.5 | 0.987 |
| Musical | 6.75 | 6.3 | 7.2 | 0.636 |
| Mystery | 6.84 | 4.6 | 8.5 | 0.910 |
| Romance | 6.65 | 6.2 | 7.1 | 0.636 |
| Sci-Fi | 6.66 | 5.0 | 8.2 | 1.094 |
| Thriller | 4.80 | 4.8 | 4.8 | NA |
| Western | 5.70 | 4.1 | 7.3 | 2.263 |
unique_movies %>%
ggplot(aes(x=rating)) +
geom_histogram() +
facet_wrap(vars(genre)) +
labs(title = "Rating statistics of each genre", x = "Rating", y = "Count")
Let us examine the relationship between gross and cast_facebook_likes. A scatterplot has been produced discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office.
unique_movies %>%
ggplot(aes(y=gross, x=cast_facebook_likes)) +
geom_point() +
labs(title = "Gross vs. Cast facebook likes", x = "Cast Facebook Likes", y = "Gross [$]")+
NULL
It’s unlikely that number of facebook likes the cast received will be a good predictor of how much money a movie will make. X-axis is the independent variable “cast_facebook_likes”, Y-axis is the dependent variable “gross”. There is no obvious correlation between the two variables.
Let us examine the relationship between gross and budget. A scatterplot has been produced to discuss whether budget is likely to be a good predictor of how much money a movie will make at the box office.
unique_movies %>%
ggplot(aes(y=gross, x=budget)) +
geom_point() +
labs(title = "Gross vs. Budget", x = "Budget [$]", y = "Gross [$]")+
NULL
It can be inferred from the plot that budget is a weak predictor of how much money a movie will make. There is a lot of variation and scatter in the data thus the correlation is very weak.
Let us examine the relationship between gross and rating. A scatterplot has been produced, faceted by genre and let’d discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office.
unique_movies %>%
ggplot(aes(y=gross, x=rating)) +
geom_point() +
facet_wrap(vars(genre))+
labs(title = "Gross vs. the rating of the movie genre", x = "Rating", y = "Gross [$]")+
NULL
In the genre where there are enough data instances, the rating can be a good predictor for how much money a movie will make at the box office. This is expected because movies that are more popular are more likely to make more money. In the Action and Adventure genres, where there are lots of data points, the trend seems to be exponential. In the genres with few data points, it’s hard to determine what the relationship is between the two variables.
Let us run a hypothesis test, using both the t.test command and the infer package to simulate from a null distribution, where we assume zero difference between the two. H0:μA−μB=0 H1:μA−μB≠0 t-stat is equal to 3 and p-value is equal to 0.01. Therefore, we can reject the null hypothesis. Therefore, we can conclude that we are 95% confident that movies of Stephen Spielberg have higher average ratings than those of Tim Burton. Let us load the data and examine its structure!
movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
#calculate the week4 and week5 confidence intervals
rating_comparison <- movies %>%
filter(director=="Steven Spielberg" | director=="Tim Burton") %>%
group_by(director) %>%
summarise(avg_rating = mean(rating),
sd_rating = sd(rating, na.rm=TRUE),
count_rating = n(),
se_rating = sd_rating / sqrt(count_rating),
ci_rating_up = avg_rating + qt(.975, count_rating-1)*se_rating,
ci_rating_dw = avg_rating - qt(.975, count_rating-1)*se_rating
)
#plot the confidence intervals
rating_comparison %>%
ggplot(aes(x=avg_rating, y=director, color=director))+
geom_rect(fill="grey",alpha=0.5, color = "grey",
aes(xmin=max(ci_rating_dw),
xmax=min(ci_rating_up),
ymin=-Inf,
ymax=Inf))+
geom_errorbarh(aes(xmin=ci_rating_dw,xmax=ci_rating_up))+
geom_point(aes(x=avg_rating, y=director), size=3)+
geom_text(aes(label=round(avg_rating, digits=2)), vjust = -1.5)+
labs(title="Do Spielberg and Burton have the same mean IMDB ratings?",
subtitle = "95% confidence intervals overlap",
x = "Mean IMDB Rating")
#calculate via t-test
comparison_rating <- movies %>%
filter(director=="Steven Spielberg" | director=="Tim Burton")
t.test(rating ~ director, data = comparison_rating)
##
## Welch Two Sample t-test
##
## data: rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means between group Steven Spielberg and group Tim Burton is not equal to 0
## 95 percent confidence interval:
## 0.16 1.13
## sample estimates:
## mean in group Steven Spielberg mean in group Tim Burton
## 7.57 6.93
#calculate using infer package
set.seed(1234)
ratings_in_null <- comparison_rating %>%
specify(rating ~ director) %>%
hypothesise(null="independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "diff in means", order = c("Tim Burton",
"Steven Spielberg"))
#calculate the observed difference
observed_difference <- comparison_rating %>%
specify(rating ~ director) %>%
calculate(stat = "diff in means")
#calculate the p-value of the differences
ratings_in_null %>%
get_pvalue(obs_stat=observed_difference, direction="both")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.02