Note: Created for the Coursera course “Linear Regression and Modeling” in the Statistics with R sequence by Duke University.
library(ggplot2)
library(dplyr)
library(statsr)load("movies.Rdata")The data set is comprised of 651 randomly sampled movies, described by 32 variables. The sources for the data set are the Rotten Tomatoes and IMDB APIs; URLs to both websites are provided for each movie in the data set. According to the directions for the course assignment, the movies were produced and released before 2016. In fact, the movies all had theatrical releases before 2015, although some had DVD releases in 2015. The range of theatrical releases covered by the sample ranges from the years 1970 to 2014, a span of 45 years. However, only one movie in the sample has a theatrical release in 1970 and no movies included were released in 1971.
movies %>%
summarize(min(thtr_rel_year), max(thtr_rel_year))## # A tibble: 1 × 2
## `min(thtr_rel_year)` `max(thtr_rel_year)`
## <dbl> <dbl>
## 1 1970 2014
A histogram of movie theatrical releases from the sample shows an increase over time. One may presume that this sampling result reflects the fact that the number of movies produced and released per year has increased over time so that a simple random sample from this time span would end up providing more movies in the sample from later years in the time span than from earlier years. Exactly how a random sample (simple random sample? stratified sample?) was constructed was not provided by the data set. The degree of the jaggedness in the histogram is a little puzzling.
ggplot(data = movies, aes(x = thtr_rel_year)) +
geom_histogram(binwidth = 1)Since random sampling was used, the results are generalizable. However, given the lack of movies from earlier eras, one should be cautious in extrapolating results to earlier eras of movie making (such as the 1920s, ’30s, or ’40s).
The data does not derive from an experiment with a randomized control group; so, causal conclusions are not warranted.
Question: Which variables in the “movies” data set are predictors for the general popularity of a movie?
The website IMDb provides a rating as an aggregated score from 1 - 10 that is based on individual user votes. The IMDb rating is not an arithmetic average, however, but a weighted average utilizing an undisclosed method of filtering and weighting with the effect that not all individual user scores have the same weight. The Rotten Tomatoes audience score is the percentage of users who have rated the movie positively. Each of these scoring methods provides a single number intended to measure the general popularity of a movie. Many people find these measures useful for deciding whether to see a movie or not.
The general popularity of a movie sometimes does not cohere with views about the movie by official movie critics. However, critics are sometimes provided with early screenings and their reviews become available often before or concurrent with the official release of the movie, and thereby have the potential to influence the popular reception of a movie. To what degree are critics’ reviews associated with these measures of general popularity?
As measures of general popularity, the IMDb rating and the Rotten
Tomatoes audience score should be strongly correlated. This will be
checked. Then the critics_score variable from Rotten
Tomatoes will be used to create a linear model for predicting both IMDb
rating and the Rotten Tomatoes audience score. The effect of other
variables will be considered following a forward selection strategy
toward building the most effective and parsimonious models for
predicting these two measures of general popularity.
As shown below, the IMDb rating and the Rotten Tomatoes audience score are highly correlated.
ggplot(data = movies, aes(x = audience_score, y = imdb_rating)) +
geom_jitter() +
stat_smooth(method = "lm", se = TRUE)## `geom_smooth()` using formula = 'y ~ x'
Fitting a linear model to imdb_rating versus
audience_score yields a value for R-squared of 0.748 (shown
below), implying a high correlation coefficient of 0.865. This R-squared
value tells us that nearly three quarters of the variation in
imdb_rating is accounted for by the
audience_score value. A linear model seems justified, since
by observation of the scatterplot we can see that (a) there is a linear
association, (b) the spread of points around the fitted line is nearly
normal (with some slight skew toward the negative residuals), and (c)
the variability appears nearly constant (with perhaps some tapering on
the high end).
m_imdb_audience <- lm(imdb_rating ~ audience_score, data = movies)
summary(m_imdb_audience)##
## Call:
## lm(formula = imdb_rating ~ audience_score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2082 -0.1866 0.0712 0.3093 1.1516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.599992 0.069291 51.95 <2e-16 ***
## audience_score 0.046392 0.001057 43.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.545 on 649 degrees of freedom
## Multiple R-squared: 0.748, Adjusted R-squared: 0.7476
## F-statistic: 1926 on 1 and 649 DF, p-value: < 2.2e-16
What we have seen so far is that in imdb_rating and
audience_score we have two highly correlated measures of
the general popularity of a movie, as expected. Although one might
fairly well “predict” imdb_rating from
audience_score, the practical usefulness of this
arrangement is obviated by the fact that both variables develop as
measures of general popularity in exactly the same time frame – namely,
after the theatrical release date. What we seek are variables that one
might have access to before imdb_rating and
audience_score are available or have had sufficient time to
develop. Since these two variables are highly correlated, for the
remainder of this exploratory analysis, we will focus on the Rotten
Tomatoes audience_score. Insights gleaned are expected to
apply to imdb_rating as well.
One would expect there to be a positive correlation between the
Rotten Tomatoes critics_score and
audience_score. We see that this is the case with a
scatterplot and the fitted least-squares line.
ggplot(data = movies, aes(x = critics_score, y = audience_score)) +
geom_jitter() +
stat_smooth(method = "lm", se = TRUE)## `geom_smooth()` using formula = 'y ~ x'
A linear model seems justified. By observation of the scatterplot we can see that (a) there is a linear association, (b) the spread of points around the fitted line is nearly normal, and (c) the variability appears nearly constant.
Other kinds of data in the data set that one would have access to
before the theatrical release date and which might provide some
predictive power are the following: title_type,
genre, runtime, mpaa_rating,
best_actor_win, best_actress_win, and
best_dir_win. We examine these variables now to determine
which to include in building a model.
There are three values for the categorical variable
title_type: Documentary, Feature Film, and TV Movie.
Side-by-side boxplots for these values show that the “Documentary” type
is associated with higher audience scores; so, there might be some
predictive power to be had there.
ggplot(data = movies, aes(x = title_type, y = audience_score)) +
geom_boxplot(fill = "yellow")Similarly, when one considers the categorical variable
genre, one finds some values that are strongly associated
with certain ranges of audience scores – for example, “Animation”,
“Documentary”, and “Musical & Performing Arts”.
ggplot(data = movies, aes(x = genre, y = audience_score)) +
geom_boxplot(fill = "yellow") +
theme(axis.text.x= element_text(angle=90, hjust = 1, vjust = 0.5))The category “Documentary” shows up as a value in both
title_type and genre. Clearly, documentaries
are associated with higher audience scores. A table breaking down
title_type by genre shows that there is a
significant overlap between the title_type “Documentary”
and the genre “Documentary” values, although it is not a
complete identity since there are three feature films that are also
classified as documentaries by genre.
movies %>%
group_by(title_type, genre) %>%
summarize(n())## `summarise()` has grouped output by 'title_type'. You can override using the
## `.groups` argument.
## # A tibble: 16 × 3
## # Groups: title_type [3]
## title_type genre `n()`
## <fct> <fct> <int>
## 1 Documentary Comedy 2
## 2 Documentary Documentary 49
## 3 Documentary Musical & Performing Arts 4
## 4 Feature Film Action & Adventure 65
## 5 Feature Film Animation 9
## 6 Feature Film Art House & International 14
## 7 Feature Film Comedy 85
## 8 Feature Film Documentary 3
## 9 Feature Film Drama 301
## 10 Feature Film Horror 23
## 11 Feature Film Musical & Performing Arts 8
## 12 Feature Film Mystery & Suspense 59
## 13 Feature Film Other 15
## 14 Feature Film Science Fiction & Fantasy 9
## 15 TV Movie Drama 4
## 16 TV Movie Other 1
This kind of overlap means that these two variables will be collinear
to some degree. So, for the purposes of model building we will employ
the genre variable rather than the title_type
variable in what follows.
When it comes to runtime, we wouldn’t expect runtime to
be strongly associated with audience score and a look at a scatterplot
confirms this, as shown below. Although one can fit a linear model,
there isn’t an obvious linear relationship and the linear model violates
the constant variability requirement. The points are largely spread out
evenly vertically rather than having a horizontal spread. As such, the
model is highly susceptible to points of high leverage. So, we will
exclude runtime from model building.
ggplot(data = movies, aes(x = runtime, y = audience_score)) +
geom_point() +
stat_smooth(method = "lm", se = TRUE)## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
When we consider mpaa_rating, we see that most values
are not strongly associated with a specific range of
audience_score with the exception of the NC-17 rating.
ggplot(data = movies, aes(x = mpaa_rating, y = audience_score)) +
geom_boxplot(fill = "yellow")Finally, when we consider best_actor_win,
best_actress_win, and best_dir_win, we don’t
see any strong association. So, we expect that these categorical
variables will not make much difference when included in a model.
ggplot(data = movies, aes(x = best_actor_win, y = audience_score)) +
geom_boxplot(fill = "yellow")ggplot(data = movies, aes(x = best_actress_win, y = audience_score)) +
geom_boxplot(fill = "yellow")ggplot(data = movies, aes(x = best_dir_win, y = audience_score)) +
geom_boxplot(fill = "yellow")In summary, then, besides critics_score, we have seen
that there might be some advantage to including in our model
genre and mpaa_rating, although it is not so
clear since only certain values of those categorical variables were
strongly associated with particular ranges of audience score.
We will proceed to model audience_score by means of a
forward model selection strategy. Once completed, we will then see how
the model performs for imdb_rating. Our first model will
include only critics_score. This will establish a
baseline.
m_critics <- lm(audience_score ~ critics_score, data = movies)
summary(m_critics)##
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.043 -9.571 0.504 10.422 43.544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.43551 1.27561 26.21 <2e-16 ***
## critics_score 0.50144 0.01984 25.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.37 on 649 degrees of freedom
## Multiple R-squared: 0.496, Adjusted R-squared: 0.4952
## F-statistic: 638.7 on 1 and 649 DF, p-value: < 2.2e-16
We see that critics_score is significant with a very
small p-value and that the adjusted R-squared obtained is 0.4952 which
establishes a baseline for adjusted R-squared.
As a check on our earlier decision not to include
best_actor_win, best_actress_win, and
best_dir_win, we can observe the impact of adding those
variables to the model.
m_wins <- lm(audience_score ~ critics_score + best_actor_win + best_actress_win + best_dir_win, data = movies)
summary(m_wins)##
## Call:
## lm(formula = audience_score ~ critics_score + best_actor_win +
## best_actress_win + best_dir_win, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.259 -9.345 0.703 10.194 43.384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.58358 1.29510 25.931 <2e-16 ***
## critics_score 0.50234 0.02009 25.000 <2e-16 ***
## best_actor_winyes -0.71217 1.63509 -0.436 0.663
## best_actress_winyes -0.96452 1.82223 -0.529 0.597
## best_dir_winyes 0.12513 2.30833 0.054 0.957
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.39 on 646 degrees of freedom
## Multiple R-squared: 0.4964, Adjusted R-squared: 0.4933
## F-statistic: 159.2 on 4 and 646 DF, p-value: < 2.2e-16
The added variables do not show up as statistically significant. Plus, their addition does not increase, but rather decreases (with a value of 0.4933), adjusted R-squared.
We see the same sort of effect when adding mpaa_rating.
None of the indicator variables created to capture the many values of
the categorical variable mpaa_rating are significant. Plus,
with a value of 0.4946, the adjusted R-squared is not improved over our
baseline of 0.4952.
m_mpaa <- lm(audience_score ~ critics_score + mpaa_rating, data = movies)
summary(m_mpaa)##
## Call:
## lm(formula = audience_score ~ critics_score + mpaa_rating, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.796 -9.621 0.625 9.954 43.431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.99319 3.57506 10.068 <2e-16 ***
## critics_score 0.49252 0.02091 23.550 <2e-16 ***
## mpaa_ratingNC-17 -13.37247 10.69370 -1.251 0.212
## mpaa_ratingPG -2.13623 3.55910 -0.600 0.549
## mpaa_ratingPG-13 -2.58721 3.54766 -0.729 0.466
## mpaa_ratingR -2.31927 3.39670 -0.683 0.495
## mpaa_ratingUnrated 1.11501 3.89299 0.286 0.775
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.38 on 644 degrees of freedom
## Multiple R-squared: 0.4992, Adjusted R-squared: 0.4946
## F-statistic: 107 on 6 and 644 DF, p-value: < 2.2e-16
When adding genre, there is some improvement obtained.
The adjusted R-squared increases slightly to 0.5197. However, we note
that only some of the indicator variables created to capture the values
of the categorical variable genre are significant.
m_genre <- lm(audience_score ~ critics_score + genre, data = movies)
summary(m_genre)##
## Call:
## lm(formula = audience_score ~ critics_score + genre, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.538 -9.667 0.518 9.525 42.786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.79792 1.95465 17.803 < 2e-16 ***
## critics_score 0.45828 0.02157 21.248 < 2e-16 ***
## genreAnimation 4.63092 4.98848 0.928 0.35359
## genreArt House & International 5.53544 4.13548 1.339 0.18120
## genreComedy -1.04458 2.29788 -0.455 0.64956
## genreDocumentary 8.38177 2.78179 3.013 0.00269 **
## genreDrama 2.03588 1.96654 1.035 0.30094
## genreHorror -9.11602 3.40089 -2.680 0.00754 **
## genreMusical & Performing Arts 10.23431 4.46876 2.290 0.02234 *
## genreMystery & Suspense -4.02284 2.53702 -1.586 0.11331
## genreOther 2.15897 3.94404 0.547 0.58429
## genreScience Fiction & Fantasy -6.82279 4.98830 -1.368 0.17187
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.02 on 639 degrees of freedom
## Multiple R-squared: 0.5278, Adjusted R-squared: 0.5197
## F-statistic: 64.93 on 11 and 639 DF, p-value: < 2.2e-16
We can create new single indicator variables –
documentary, horror, and musical
(for Musical & Performing Arts) – based off the presence of
particular values of the genre variable. We can then
attempt to add these variables and see whether they, without the other
variables that did not show up as significant, make up most of the
difference in adjusted R-squared.
movies <- movies %>%
mutate(documentary = ifelse(genre == "Documentary", "yes", "no")) %>%
mutate(horror = ifelse(genre == "Horror", "yes", "no")) %>%
mutate(musical = ifelse(genre == "Musical & Performing Arts", "yes", "no"))Adding these new variables in turn – documentary,
horror, and musical – progressively to
critics_score, results in progressively improved adjusted
R-squared scores: from 0.4952 to 0.5029 to 0.5101, and finally, to
0.5131.
m_doc <- lm(audience_score ~ critics_score + documentary, data = movies)
summary(m_doc)##
## Call:
## lm(formula = audience_score ~ critics_score + documentary, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.117 -9.305 0.629 10.193 43.227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.03867 1.27880 26.618 < 2e-16 ***
## critics_score 0.48105 0.02062 23.326 < 2e-16 ***
## documentaryyes 7.17472 2.15899 3.323 0.00094 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.26 on 648 degrees of freedom
## Multiple R-squared: 0.5045, Adjusted R-squared: 0.5029
## F-statistic: 329.8 on 2 and 648 DF, p-value: < 2.2e-16
m_doc_hor <- lm(audience_score ~ critics_score + documentary + horror, data = movies)
summary(m_doc_hor)##
## Call:
## lm(formula = audience_score ~ critics_score + documentary + horror,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.388 -9.272 0.515 9.855 42.632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.70720 1.28613 26.986 < 2e-16 ***
## critics_score 0.47575 0.02054 23.164 < 2e-16 ***
## documentaryyes 6.96372 2.14429 3.248 0.00122 **
## horroryes -9.79338 3.01925 -3.244 0.00124 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.15 on 647 degrees of freedom
## Multiple R-squared: 0.5124, Adjusted R-squared: 0.5101
## F-statistic: 226.6 on 3 and 647 DF, p-value: < 2.2e-16
m_doc_hor_mus <- lm(audience_score ~ critics_score + documentary + horror + musical, data = movies)
summary(m_doc_hor_mus)##
## Call:
## lm(formula = audience_score ~ critics_score + documentary + horror +
## musical, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.100 -9.271 0.554 9.928 42.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.78813 1.28266 27.122 < 2e-16 ***
## critics_score 0.47082 0.02059 22.864 < 2e-16 ***
## documentaryyes 7.30842 2.14317 3.410 0.00069 ***
## horroryes -9.65763 3.01051 -3.208 0.00140 **
## musicalyes 9.28238 4.14005 2.242 0.02529 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.11 on 646 degrees of freedom
## Multiple R-squared: 0.5161, Adjusted R-squared: 0.5131
## F-statistic: 172.3 on 4 and 646 DF, p-value: < 2.2e-16
Although this regression model with just four variables has a
slightly lower adjusted R-squared value than when we simply add all of
the ten genre indicator variables to the
critics_score model thereby creating an 11-variable model,
the difference is slight: 0.5131 versus 0.5197. The advantage is that it
is a simpler model and all of the variables have significant p-values.
So, arguably, using both adjusted R-squared and p-values as selection
criteria for model-building, this last model balances predictive power
with simplicity the best and should be considered the most parsimonious
model by our forward selection technique.
Modeling IMDb rating using simply Rotten Tomatoes’
critics_score, results in an adjusted R-squared of
0.5846.
m_imdb_crit <- lm(imdb_rating ~ critics_score, data = movies)
summary(m_imdb_crit)##
## Call:
## lm(formula = imdb_rating ~ critics_score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.93679 -0.39499 0.04512 0.43875 2.47556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8075715 0.0620690 77.45 <2e-16 ***
## critics_score 0.0292177 0.0009654 30.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6991 on 649 degrees of freedom
## Multiple R-squared: 0.5853, Adjusted R-squared: 0.5846
## F-statistic: 915.9 on 1 and 649 DF, p-value: < 2.2e-16
If we using the final set of variables we selected for modeling
audience_score above, we obtain an adjusted R-squared of
0.5955, as shown below.
m_imdb_crit_plus <- lm(imdb_rating ~ critics_score + documentary + horror + musical, data = movies)
summary(m_imdb_crit_plus)##
## Call:
## lm(formula = imdb_rating ~ critics_score + documentary + horror +
## musical, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.99241 -0.36857 0.03371 0.43952 2.42419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.864606 0.062713 77.569 < 2e-16 ***
## critics_score 0.027801 0.001007 27.613 < 2e-16 ***
## documentaryyes 0.382945 0.104786 3.655 0.000279 ***
## horroryes -0.325780 0.147193 -2.213 0.027227 *
## musicalyes 0.303970 0.202420 1.502 0.133669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6899 on 646 degrees of freedom
## Multiple R-squared: 0.598, Adjusted R-squared: 0.5955
## F-statistic: 240.2 on 4 and 646 DF, p-value: < 2.2e-16
We now use our final audience score model to obtain some predictions for audience score for movies beyond the time period of the data set 1970 - 2014.
From the Rotten Tomatoes website, the 2016 movie “10 Cloverfield Lane” obtained a critics score of 90. It received the horror designation. Our predictive model returned an audience score of 68. The audience score on Rotten Tomatoes was 79. However, the range of uncertainty (95% confidence level) of the model is 39 to 96.
cloverfield <- data.frame(critics_score = 90, documentary = "no", horror = "yes", musical = "no")
predict(m_doc_hor_mus, cloverfield, interval = "prediction", level = 0.95)## fit lwr upr
## 1 67.50425 39.13961 95.8689
From the Rotten Tomatoes website, the movie “Knives Out” has a critics score of 96. Our predictive model for audience score gave a result of 80 (as seen below). The reported audience score on the website is 92. However, this fits within the range of uncertainty (95% confidence level) provided by the model. The range of uncertainty extends from 52 to 107 (which is beyond the scale).
knives_out <- data.frame(critics_score = 96, documentary = "no", horror = "no", musical = "no")
predict(m_doc_hor_mus, knives_out, interval = "prediction", level = 0.95)## fit lwr upr
## 1 79.9868 52.20568 107.7679
For the question – Which variables in the “movies” data set are predictors for the general popularity of a movie? – we settled upon the Rotten Tomatoes’ audience score and the IMDb rating as representatives of general popularity. These variables are highly correlated with one another. To find predictors for these variables, we focused on Rotten Tomatoes’ audience score. We examined several candidates. The obvious one was the Rotten Tomatoes’ critics score. In addition, whether or not a movie was a documentary, a horror, or fit in the genre “Musical & Performing Arts” also was significant and found to make a difference in adjusted R-squared.
In making predictions, however, our final models have large uncertainties. So, when using the models, the predictions are not that accurate. Perhaps other variables not included in the data set might help – for example, a more detailed categorization of genres. A better grasp of how the critics score is determined might also help for making better predictions. This would be something to possibly research further.