Note: Created for the Coursera course “Linear Regression and Modeling” in the Statistics with R sequence by Duke University.


Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("movies.Rdata")

Part 1: Data

The data set is comprised of 651 randomly sampled movies, described by 32 variables. The sources for the data set are the Rotten Tomatoes and IMDB APIs; URLs to both websites are provided for each movie in the data set. According to the directions for the course assignment, the movies were produced and released before 2016. In fact, the movies all had theatrical releases before 2015, although some had DVD releases in 2015. The range of theatrical releases covered by the sample ranges from the years 1970 to 2014, a span of 45 years. However, only one movie in the sample has a theatrical release in 1970 and no movies included were released in 1971.

movies %>%
  summarize(min(thtr_rel_year), max(thtr_rel_year))
## # A tibble: 1 × 2
##   `min(thtr_rel_year)` `max(thtr_rel_year)`
##                  <dbl>                <dbl>
## 1                 1970                 2014

A histogram of movie theatrical releases from the sample shows an increase over time. One may presume that this sampling result reflects the fact that the number of movies produced and released per year has increased over time so that a simple random sample from this time span would end up providing more movies in the sample from later years in the time span than from earlier years. Exactly how a random sample (simple random sample? stratified sample?) was constructed was not provided by the data set. The degree of the jaggedness in the histogram is a little puzzling.

ggplot(data = movies, aes(x = thtr_rel_year)) + 
  geom_histogram(binwidth = 1)

Since random sampling was used, the results are generalizable. However, given the lack of movies from earlier eras, one should be cautious in extrapolating results to earlier eras of movie making (such as the 1920s, ’30s, or ’40s).

The data does not derive from an experiment with a randomized control group; so, causal conclusions are not warranted.


Part 2: Research question

Question: Which variables in the “movies” data set are predictors for the general popularity of a movie?

The website IMDb provides a rating as an aggregated score from 1 - 10 that is based on individual user votes. The IMDb rating is not an arithmetic average, however, but a weighted average utilizing an undisclosed method of filtering and weighting with the effect that not all individual user scores have the same weight. The Rotten Tomatoes audience score is the percentage of users who have rated the movie positively. Each of these scoring methods provides a single number intended to measure the general popularity of a movie. Many people find these measures useful for deciding whether to see a movie or not.

The general popularity of a movie sometimes does not cohere with views about the movie by official movie critics. However, critics are sometimes provided with early screenings and their reviews become available often before or concurrent with the official release of the movie, and thereby have the potential to influence the popular reception of a movie. To what degree are critics’ reviews associated with these measures of general popularity?

As measures of general popularity, the IMDb rating and the Rotten Tomatoes audience score should be strongly correlated. This will be checked. Then the critics_score variable from Rotten Tomatoes will be used to create a linear model for predicting both IMDb rating and the Rotten Tomatoes audience score. The effect of other variables will be considered following a forward selection strategy toward building the most effective and parsimonious models for predicting these two measures of general popularity.


Part 3: Exploratory data analysis

As shown below, the IMDb rating and the Rotten Tomatoes audience score are highly correlated.

ggplot(data = movies, aes(x = audience_score, y = imdb_rating)) + 
  geom_jitter() + 
  stat_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'

Fitting a linear model to imdb_rating versus audience_score yields a value for R-squared of 0.748 (shown below), implying a high correlation coefficient of 0.865. This R-squared value tells us that nearly three quarters of the variation in imdb_rating is accounted for by the audience_score value. A linear model seems justified, since by observation of the scatterplot we can see that (a) there is a linear association, (b) the spread of points around the fitted line is nearly normal (with some slight skew toward the negative residuals), and (c) the variability appears nearly constant (with perhaps some tapering on the high end).

m_imdb_audience <- lm(imdb_rating ~ audience_score, data = movies)
summary(m_imdb_audience)
## 
## Call:
## lm(formula = imdb_rating ~ audience_score, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2082 -0.1866  0.0712  0.3093  1.1516 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.599992   0.069291   51.95   <2e-16 ***
## audience_score 0.046392   0.001057   43.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.545 on 649 degrees of freedom
## Multiple R-squared:  0.748,  Adjusted R-squared:  0.7476 
## F-statistic:  1926 on 1 and 649 DF,  p-value: < 2.2e-16

What we have seen so far is that in imdb_rating and audience_score we have two highly correlated measures of the general popularity of a movie, as expected. Although one might fairly well “predict” imdb_rating from audience_score, the practical usefulness of this arrangement is obviated by the fact that both variables develop as measures of general popularity in exactly the same time frame – namely, after the theatrical release date. What we seek are variables that one might have access to before imdb_rating and audience_score are available or have had sufficient time to develop. Since these two variables are highly correlated, for the remainder of this exploratory analysis, we will focus on the Rotten Tomatoes audience_score. Insights gleaned are expected to apply to imdb_rating as well.

One would expect there to be a positive correlation between the Rotten Tomatoes critics_score and audience_score. We see that this is the case with a scatterplot and the fitted least-squares line.

ggplot(data = movies, aes(x = critics_score, y = audience_score)) + 
  geom_jitter() + 
  stat_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'

A linear model seems justified. By observation of the scatterplot we can see that (a) there is a linear association, (b) the spread of points around the fitted line is nearly normal, and (c) the variability appears nearly constant.

Other kinds of data in the data set that one would have access to before the theatrical release date and which might provide some predictive power are the following: title_type, genre, runtime, mpaa_rating, best_actor_win, best_actress_win, and best_dir_win. We examine these variables now to determine which to include in building a model.

There are three values for the categorical variable title_type: Documentary, Feature Film, and TV Movie. Side-by-side boxplots for these values show that the “Documentary” type is associated with higher audience scores; so, there might be some predictive power to be had there.

ggplot(data = movies, aes(x = title_type, y = audience_score)) + 
  geom_boxplot(fill = "yellow")

Similarly, when one considers the categorical variable genre, one finds some values that are strongly associated with certain ranges of audience scores – for example, “Animation”, “Documentary”, and “Musical & Performing Arts”.

ggplot(data = movies, aes(x = genre, y = audience_score)) + 
  geom_boxplot(fill = "yellow") +
  theme(axis.text.x= element_text(angle=90, hjust = 1, vjust = 0.5))

The category “Documentary” shows up as a value in both title_type and genre. Clearly, documentaries are associated with higher audience scores. A table breaking down title_type by genre shows that there is a significant overlap between the title_type “Documentary” and the genre “Documentary” values, although it is not a complete identity since there are three feature films that are also classified as documentaries by genre.

movies %>% 
  group_by(title_type, genre) %>%
  summarize(n())
## `summarise()` has grouped output by 'title_type'. You can override using the
## `.groups` argument.
## # A tibble: 16 × 3
## # Groups:   title_type [3]
##    title_type   genre                     `n()`
##    <fct>        <fct>                     <int>
##  1 Documentary  Comedy                        2
##  2 Documentary  Documentary                  49
##  3 Documentary  Musical & Performing Arts     4
##  4 Feature Film Action & Adventure           65
##  5 Feature Film Animation                     9
##  6 Feature Film Art House & International    14
##  7 Feature Film Comedy                       85
##  8 Feature Film Documentary                   3
##  9 Feature Film Drama                       301
## 10 Feature Film Horror                       23
## 11 Feature Film Musical & Performing Arts     8
## 12 Feature Film Mystery & Suspense           59
## 13 Feature Film Other                        15
## 14 Feature Film Science Fiction & Fantasy     9
## 15 TV Movie     Drama                         4
## 16 TV Movie     Other                         1

This kind of overlap means that these two variables will be collinear to some degree. So, for the purposes of model building we will employ the genre variable rather than the title_type variable in what follows.

When it comes to runtime, we wouldn’t expect runtime to be strongly associated with audience score and a look at a scatterplot confirms this, as shown below. Although one can fit a linear model, there isn’t an obvious linear relationship and the linear model violates the constant variability requirement. The points are largely spread out evenly vertically rather than having a horizontal spread. As such, the model is highly susceptible to points of high leverage. So, we will exclude runtime from model building.

ggplot(data = movies, aes(x = runtime, y = audience_score)) + 
  geom_point() + 
  stat_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).

When we consider mpaa_rating, we see that most values are not strongly associated with a specific range of audience_score with the exception of the NC-17 rating.

ggplot(data = movies, aes(x = mpaa_rating, y = audience_score)) + 
  geom_boxplot(fill = "yellow")

Finally, when we consider best_actor_win, best_actress_win, and best_dir_win, we don’t see any strong association. So, we expect that these categorical variables will not make much difference when included in a model.

ggplot(data = movies, aes(x = best_actor_win, y = audience_score)) + 
  geom_boxplot(fill = "yellow")

ggplot(data = movies, aes(x = best_actress_win, y = audience_score)) + 
  geom_boxplot(fill = "yellow")

ggplot(data = movies, aes(x = best_dir_win, y = audience_score)) + 
  geom_boxplot(fill = "yellow")

In summary, then, besides critics_score, we have seen that there might be some advantage to including in our model genre and mpaa_rating, although it is not so clear since only certain values of those categorical variables were strongly associated with particular ranges of audience score.


Part 4: Modeling

Modeling Rotten Tomatoes’ Audience Score

We will proceed to model audience_score by means of a forward model selection strategy. Once completed, we will then see how the model performs for imdb_rating. Our first model will include only critics_score. This will establish a baseline.

m_critics <- lm(audience_score ~ critics_score, data = movies)
summary(m_critics)
## 
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.043  -9.571   0.504  10.422  43.544 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   33.43551    1.27561   26.21   <2e-16 ***
## critics_score  0.50144    0.01984   25.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.37 on 649 degrees of freedom
## Multiple R-squared:  0.496,  Adjusted R-squared:  0.4952 
## F-statistic: 638.7 on 1 and 649 DF,  p-value: < 2.2e-16

We see that critics_score is significant with a very small p-value and that the adjusted R-squared obtained is 0.4952 which establishes a baseline for adjusted R-squared.

As a check on our earlier decision not to include best_actor_win, best_actress_win, and best_dir_win, we can observe the impact of adding those variables to the model.

m_wins <- lm(audience_score ~ critics_score + best_actor_win + best_actress_win + best_dir_win, data = movies)
summary(m_wins)
## 
## Call:
## lm(formula = audience_score ~ critics_score + best_actor_win + 
##     best_actress_win + best_dir_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.259  -9.345   0.703  10.194  43.384 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         33.58358    1.29510  25.931   <2e-16 ***
## critics_score        0.50234    0.02009  25.000   <2e-16 ***
## best_actor_winyes   -0.71217    1.63509  -0.436    0.663    
## best_actress_winyes -0.96452    1.82223  -0.529    0.597    
## best_dir_winyes      0.12513    2.30833   0.054    0.957    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.39 on 646 degrees of freedom
## Multiple R-squared:  0.4964, Adjusted R-squared:  0.4933 
## F-statistic: 159.2 on 4 and 646 DF,  p-value: < 2.2e-16

The added variables do not show up as statistically significant. Plus, their addition does not increase, but rather decreases (with a value of 0.4933), adjusted R-squared.

We see the same sort of effect when adding mpaa_rating. None of the indicator variables created to capture the many values of the categorical variable mpaa_rating are significant. Plus, with a value of 0.4946, the adjusted R-squared is not improved over our baseline of 0.4952.

m_mpaa <- lm(audience_score ~ critics_score + mpaa_rating, data = movies)
summary(m_mpaa)
## 
## Call:
## lm(formula = audience_score ~ critics_score + mpaa_rating, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.796  -9.621   0.625   9.954  43.431 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         35.99319    3.57506  10.068   <2e-16 ***
## critics_score        0.49252    0.02091  23.550   <2e-16 ***
## mpaa_ratingNC-17   -13.37247   10.69370  -1.251    0.212    
## mpaa_ratingPG       -2.13623    3.55910  -0.600    0.549    
## mpaa_ratingPG-13    -2.58721    3.54766  -0.729    0.466    
## mpaa_ratingR        -2.31927    3.39670  -0.683    0.495    
## mpaa_ratingUnrated   1.11501    3.89299   0.286    0.775    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.38 on 644 degrees of freedom
## Multiple R-squared:  0.4992, Adjusted R-squared:  0.4946 
## F-statistic:   107 on 6 and 644 DF,  p-value: < 2.2e-16

When adding genre, there is some improvement obtained. The adjusted R-squared increases slightly to 0.5197. However, we note that only some of the indicator variables created to capture the values of the categorical variable genre are significant.

m_genre <- lm(audience_score ~ critics_score + genre, data = movies)
summary(m_genre)
## 
## Call:
## lm(formula = audience_score ~ critics_score + genre, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.538  -9.667   0.518   9.525  42.786 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    34.79792    1.95465  17.803  < 2e-16 ***
## critics_score                   0.45828    0.02157  21.248  < 2e-16 ***
## genreAnimation                  4.63092    4.98848   0.928  0.35359    
## genreArt House & International  5.53544    4.13548   1.339  0.18120    
## genreComedy                    -1.04458    2.29788  -0.455  0.64956    
## genreDocumentary                8.38177    2.78179   3.013  0.00269 ** 
## genreDrama                      2.03588    1.96654   1.035  0.30094    
## genreHorror                    -9.11602    3.40089  -2.680  0.00754 ** 
## genreMusical & Performing Arts 10.23431    4.46876   2.290  0.02234 *  
## genreMystery & Suspense        -4.02284    2.53702  -1.586  0.11331    
## genreOther                      2.15897    3.94404   0.547  0.58429    
## genreScience Fiction & Fantasy -6.82279    4.98830  -1.368  0.17187    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.02 on 639 degrees of freedom
## Multiple R-squared:  0.5278, Adjusted R-squared:  0.5197 
## F-statistic: 64.93 on 11 and 639 DF,  p-value: < 2.2e-16

We can create new single indicator variables – documentary, horror, and musical (for Musical & Performing Arts) – based off the presence of particular values of the genre variable. We can then attempt to add these variables and see whether they, without the other variables that did not show up as significant, make up most of the difference in adjusted R-squared.

movies <- movies %>%
  mutate(documentary = ifelse(genre == "Documentary", "yes", "no")) %>%
  mutate(horror = ifelse(genre == "Horror", "yes", "no")) %>%
  mutate(musical = ifelse(genre == "Musical & Performing Arts", "yes", "no"))

Adding these new variables in turn – documentary, horror, and musical – progressively to critics_score, results in progressively improved adjusted R-squared scores: from 0.4952 to 0.5029 to 0.5101, and finally, to 0.5131.

m_doc <- lm(audience_score ~ critics_score + documentary, data = movies)
summary(m_doc)
## 
## Call:
## lm(formula = audience_score ~ critics_score + documentary, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.117  -9.305   0.629  10.193  43.227 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    34.03867    1.27880  26.618  < 2e-16 ***
## critics_score   0.48105    0.02062  23.326  < 2e-16 ***
## documentaryyes  7.17472    2.15899   3.323  0.00094 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.26 on 648 degrees of freedom
## Multiple R-squared:  0.5045, Adjusted R-squared:  0.5029 
## F-statistic: 329.8 on 2 and 648 DF,  p-value: < 2.2e-16
m_doc_hor <- lm(audience_score ~ critics_score + documentary + horror, data = movies)
summary(m_doc_hor)
## 
## Call:
## lm(formula = audience_score ~ critics_score + documentary + horror, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.388  -9.272   0.515   9.855  42.632 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    34.70720    1.28613  26.986  < 2e-16 ***
## critics_score   0.47575    0.02054  23.164  < 2e-16 ***
## documentaryyes  6.96372    2.14429   3.248  0.00122 ** 
## horroryes      -9.79338    3.01925  -3.244  0.00124 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.15 on 647 degrees of freedom
## Multiple R-squared:  0.5124, Adjusted R-squared:  0.5101 
## F-statistic: 226.6 on 3 and 647 DF,  p-value: < 2.2e-16
m_doc_hor_mus <- lm(audience_score ~ critics_score + documentary + horror + musical, data = movies)
summary(m_doc_hor_mus)
## 
## Call:
## lm(formula = audience_score ~ critics_score + documentary + horror + 
##     musical, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.100  -9.271   0.554   9.928  42.620 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    34.78813    1.28266  27.122  < 2e-16 ***
## critics_score   0.47082    0.02059  22.864  < 2e-16 ***
## documentaryyes  7.30842    2.14317   3.410  0.00069 ***
## horroryes      -9.65763    3.01051  -3.208  0.00140 ** 
## musicalyes      9.28238    4.14005   2.242  0.02529 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.11 on 646 degrees of freedom
## Multiple R-squared:  0.5161, Adjusted R-squared:  0.5131 
## F-statistic: 172.3 on 4 and 646 DF,  p-value: < 2.2e-16

Although this regression model with just four variables has a slightly lower adjusted R-squared value than when we simply add all of the ten genre indicator variables to the critics_score model thereby creating an 11-variable model, the difference is slight: 0.5131 versus 0.5197. The advantage is that it is a simpler model and all of the variables have significant p-values. So, arguably, using both adjusted R-squared and p-values as selection criteria for model-building, this last model balances predictive power with simplicity the best and should be considered the most parsimonious model by our forward selection technique.

Modeling IMDb Rating

Modeling IMDb rating using simply Rotten Tomatoes’ critics_score, results in an adjusted R-squared of 0.5846.

m_imdb_crit <- lm(imdb_rating ~ critics_score, data = movies)
summary(m_imdb_crit)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.93679 -0.39499  0.04512  0.43875  2.47556 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.8075715  0.0620690   77.45   <2e-16 ***
## critics_score 0.0292177  0.0009654   30.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6991 on 649 degrees of freedom
## Multiple R-squared:  0.5853, Adjusted R-squared:  0.5846 
## F-statistic: 915.9 on 1 and 649 DF,  p-value: < 2.2e-16

If we using the final set of variables we selected for modeling audience_score above, we obtain an adjusted R-squared of 0.5955, as shown below.

m_imdb_crit_plus <- lm(imdb_rating ~ critics_score + documentary + horror + musical, data = movies)
summary(m_imdb_crit_plus)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score + documentary + horror + 
##     musical, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.99241 -0.36857  0.03371  0.43952  2.42419 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.864606   0.062713  77.569  < 2e-16 ***
## critics_score   0.027801   0.001007  27.613  < 2e-16 ***
## documentaryyes  0.382945   0.104786   3.655 0.000279 ***
## horroryes      -0.325780   0.147193  -2.213 0.027227 *  
## musicalyes      0.303970   0.202420   1.502 0.133669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6899 on 646 degrees of freedom
## Multiple R-squared:  0.598,  Adjusted R-squared:  0.5955 
## F-statistic: 240.2 on 4 and 646 DF,  p-value: < 2.2e-16

Part 5: Prediction

We now use our final audience score model to obtain some predictions for audience score for movies beyond the time period of the data set 1970 - 2014.

10 Cloverfield Lane (2016)

From the Rotten Tomatoes website, the 2016 movie “10 Cloverfield Lane” obtained a critics score of 90. It received the horror designation. Our predictive model returned an audience score of 68. The audience score on Rotten Tomatoes was 79. However, the range of uncertainty (95% confidence level) of the model is 39 to 96.

cloverfield <- data.frame(critics_score = 90, documentary = "no", horror = "yes", musical = "no")
predict(m_doc_hor_mus, cloverfield, interval = "prediction", level = 0.95)
##        fit      lwr     upr
## 1 67.50425 39.13961 95.8689

Knives Out (2019)

From the Rotten Tomatoes website, the movie “Knives Out” has a critics score of 96. Our predictive model for audience score gave a result of 80 (as seen below). The reported audience score on the website is 92. However, this fits within the range of uncertainty (95% confidence level) provided by the model. The range of uncertainty extends from 52 to 107 (which is beyond the scale).

knives_out <- data.frame(critics_score = 96, documentary = "no", horror = "no", musical = "no")
predict(m_doc_hor_mus, knives_out, interval = "prediction", level = 0.95)
##       fit      lwr      upr
## 1 79.9868 52.20568 107.7679

Part 6: Conclusion

For the question – Which variables in the “movies” data set are predictors for the general popularity of a movie? – we settled upon the Rotten Tomatoes’ audience score and the IMDb rating as representatives of general popularity. These variables are highly correlated with one another. To find predictors for these variables, we focused on Rotten Tomatoes’ audience score. We examined several candidates. The obvious one was the Rotten Tomatoes’ critics score. In addition, whether or not a movie was a documentary, a horror, or fit in the genre “Musical & Performing Arts” also was significant and found to make a difference in adjusted R-squared.

In making predictions, however, our final models have large uncertainties. So, when using the models, the predictions are not that accurate. Perhaps other variables not included in the data set might help – for example, a more detailed categorization of genres. A better grasp of how the critics score is determined might also help for making better predictions. This would be something to possibly research further.