Data Wrangling Project -- "We Rate Dogs" on Twitter

Author: Brian Allan Woodcock

Date: March 2021

Course: Data Analyst Nanodegree (Udacity)

Gathering Data

Assessing Data


df_archive


df_api


df_predictions

Among the five cases above in which all three predictions were non-dog, the joint predictions were wrong in only one case, index number 18. The picture is of the back of a dog whose frontside is only visible in a computer screen.

There are a lot of tweets of animals that are not dogs. The predictions in df_predictions can be used to help find them, albeit with some degree of error.


df_archive.tweet_id


df_archive.in_reply_to_status_id


df_archive.source


df_archive.retweeted_status_id

There are 181 nonnull retweeted_status_id's, but only 112 of them show up in the tweet_id's. Presumably the others are retweets of tweets outside the time period of this selection or they are from tweets that were not included in this selection in some other way. Let's examine a pair of a tweet and retweet below.

Index number 75 is the tweet and index number 73 is the retweet. Notice that the timestamp of the tweet is the same as the retweeted_status_timestamp, as expected. Plus, as expected, expanded_urls look to be the same. Also notice that the retweet begins with "RT @dog_rates:" and then repeats the text of the tweet. Can we find all the retweets not only by looking for retweeted_status_id, but also using the text variable?

df_archive.expanded_urls

Most of the cases without expanded urls (i.e., 59 without pictures) are replies (55). Only 4 are null. The comments are interesting though because they reveal that sometimes ratings get corrected/upgraded, or as they say, "pupgraded".

These 4 cases should definitely be dropped. The first is retweeted. The other three seem not to be about a specific dog picture.


df_archive: rating_numerator and rating_denominator


df_archive.name


df_archive: doggo, floofer, pupper, puppo

No, they are not mutually exclusive. Sometimes a "doggo" is also called a "puppo" (1 case). Likewise for the pairs, "doggo" and "pupper" (12 cases), "doggo" and "floofer" (1 case).

This is a picture of a black lab with a sign on its back saying "Support Labs". The use of both descriptors for the same dog clearly created some confusion, leading one commenter to ask "which is it? Doggo or puppo?" However, this case illustrates that both attributes can be applied together, suggesting that these are not values of a single variable, but individual variables.

460: One dog, being called both a pupper (developmental use) and a doggo (generic use).
531: Two different dogs. One a puppy and the other an adult. Using "pupper" and "doggo" as developmental stages.
565: Two different dogs, one a puppy and one an adult.
575: Same dog. "doggo" being used generically.
705: Not a dog, but a hedgehog.
733: Two different dogs. Developmental use of terms.
778: Two different dogs. Developmental use.
822: Two different dogs. Developmental use.
889: Two different dogs. Developmental use.
956: One dog. But non-attributive use of terms.
1063: Same as 822
1113: Same as 778

Picture is of a dog in a log with only its fluffy head visible. The remark is meant to be rhetorical. So, it is truly being considered a "doggo" by implication. The designation "floofer" is being used non-attributively in jest. One might quibble about this, however.

Cleaning Issues

Quality Issues

Tidiness Issues

Cleaning Data

Making Copies


df_archive: The terms 'doggo', 'floofer', 'pupper', and 'puppo' are often used as values of a single variable, but not always. Exceptions by index number from df_archive are: 191, 460, and 575.

df_archive: doggo, floofer, pupper, and puppo columns should be categorical, not object.

Define

Replace 'None' with pd.NaN in the four columns. Combine the 'doggo', 'floofer', 'pupper', and 'puppo' columns into a single variable, dog_designation, in which these designations become values of the variable. Rare exceptions with dual designations can be handled by allowing the hyphenated values 'floofer-doggo', 'puppo-doggo' and 'pupper-doggo' to stand in for the corresponding combination.

Explanation: The new column dog_designation will register the designation attributed to a dog by We Rate Dogs, when such a designation was given. When no designation was given, it shall take a null value. So, dog_designation does not signify what a dog really is, rather what We Rate Dogs designated, if any designation was given. Second, although 'puppo', 'pupper', and 'doggo' are often used to desginate mutually exclusive stages in the development of a dog (one might call this the developmental use of these terms) so that a 'puppo' is a really young dog, a 'pupper' an adolescent, and a 'doggo' an adult, the terms are not always used this way. The term 'doggo' sometimes has a generic usage, covering any dog no matter what age. With the generic usage of 'doggo' in play, something can be both a pupper and a doggo -- a "pupper-doggo". Similarly, a dog can be a 'puppo-doggo' or a 'fluffer-doggo' when 'doggo' is used in the generic sense. Sometimes, however, dual designations are due to the fact that the tweet concerns a picture with two dogs in it.

Code
Test

Merge df_api with df_archive

Define

Merge the two dataframes.

Code
Test

New Cleaning Issue: favorite_count and retweet_count should be int, not float64

Define

Change favorite_count and retweet_count to Int64.

Code
Test

df_archive: "timestamp" columns should be datetime (columns 3 and 8)

df_archive: "_id" columns should be string rather than int64 or float64 (columns 0, 1, 2, 6, 7)

Define

Change timestamp and retweeted_status_timestamp to datetime. Change all "_id" columns to string.

Code
Test

df_archive: source should be split into two variables - source_name and source_url - extracted from the HTML. source_name should be categorical and source_url should be string (or object).

Define

Extract source_name and source_weblink from the HTML. Assign the values to the appropriate new columns. Change source_name to categorical. Source_weblink should already be string (or object). Drop source column.

Code
Test

df_archive: expanded_urls sometimes has more than one url, separating them by a comma. This defeats the ability to just click on them to go to the webpage. Sometimes the second url is just a repeat of the first (as in rows 4 and 7). Sometimes the two urls are different (as in row 6).

Define

Split expanded_urls into two columns -- url_1 and url_2. Where the urls are the same, only record one of them, leaving a null value for url_2.

Code
Test

df_archive: 181 retweets are included but not wanted to meet the goal of only "original ratings (no retweets) that have images".

Define

Drop the 181 rows with retweets. In addition, drop the columns dealing with retweets.

Code
Test

df_archive: There are 79 cases of duplicates for expanded_urls, not including NaN's. Is that a problem?

df_archive: There are 59 cases without expanded urls and no original picture. In these cases most (55) are replies. The remaining 4 should be dropped because they are retweets or not about a specific dog picture. Some of the replies mention that the rating is being "pupgraded".

Define

Expanded_urls has been converted to url_1 and url_2. Since the latter were populated with urls in sequential order, url_1 alone can be used to tell us when expanded_urls was null.

Code
Test

Reorder the columns in a more intuitive manner


There are duplicates: 778/1113, 822/1063 by index number in df_archive. 778 and 822 are retweets.

Define

First, check whether the duplicates (778, 822) have not already been eliminated. Second, if they have not, eliminate them.

Code

df_archive: Index number 200 in df_archive should be considered a "doggo" by implication, but not a "floofer".

Define

Change 'dog_designation' to 'doggo' if it is not already.

Code
Test

df_archive: There are some cases in which the tweet concerns two different dogs as opposed to a single dog, often a puppy ("pupper") and an adult dog ("doggo") together: 531, 565, 733, 778/1113, 822/1063, 889 by index number in df_archive.

Define
Code
Test

df_archive: Index number 705 in df_archive concerns a hedgehog, not a dog.

Define

Drop the row for index number 705.

Code
Test

df_archive: There are some obvious mistakes in the name column: 'None', 'a', 'the', 'an',...

Define

Extract names from the text column and compare the result with what was provided originally, replacing the original if the result improves on it. Where no name is extracted, fill in with a null value instead of the word 'None'.

Code

Note: The process used to develop the Regex pattern used to find names involved looking at lines of text from df_archive0, then trying out regex patterns at the regex tester website https://pythex.org/. As a regex pattern was developed it was used to isolate lines of text in df_archive0 that did NOT yield a name in order to visually examine those lines to look for patterns missed so that they could be added to the regex pattern and tested. The code below represents some of how that process occurred, where pattern kept being updated with new regex patterns.

Test

None of the new names are in the list of problem names from the old extraction.

The original name extraction resulted in 2109 names but 696 of them were clearly problematic, leaving 2109 - 696 = 1413 non-problematic names. The new extraction results in 1437 (non-problematic) names.

The new name column improves on the old one. It improves on the problematic captures ('None', 'a', 'an', 'the', 'very', 'one', 'O', 'my') of the old one while capturing everthing else that the old one did. So, the old one can be replaced with the new.

Clean up

df_archive: rating_denominator has implausible values in some cases (e.g., 50, 80, 170, 0, etc.)

df_archive: rating_numerator has implausible values in some cases (e.g., 420, 75, 80, etc.)

Define

Produce a new extraction from the text column of numerator and denominator values. In some cases, the numerator was provided in the text as a decimal. So, the data type of the numerator column should be "float". Compare the new extraction with the old one and replace the old one if the new is better.

Code
Test

Comparing the new denominator extraction with the old.

There is one place where the new denominator extraction is null when the old denominator extraction is not. When we look at it we see that it should be null. So, the new extraction gets it right.

When we look at cases where the new denominator extraction differs from the old (see below), we see that the new gets it right. So, the new denominator extraction is an improvement.

Where they differ, the new denominator extraction is better than the old.

Let's compare the new numerator extraction with the old. We already see that in the above three cases the new is better than the old.

In comparing the new numerator with the old we first see the three cases in which the old denominator was not 10 but should have been. Let us examine the text in the other cases.

The new extraction, including decimal values, is an improvement.

Clean Up
Observations

Observation: Where the denominator is not 10, it is because there is more than one dog involved. In the above list, 1165 and 1202 should not be included because there were two ratios in the text and the extraction picked up the first ratio when the second was really the one wanted. Excluding those two cases, in all the rest of the cases a denominator greater than 10 indicates the presence of more than one dog. A denominator greater than 10 is a sufficient indicator of more than one dog, but not a necessary one. This is evidenced by case 1165 which includes more than one dog and awards "13/10 for all".

Observation: Low ratings are often given for something comical or ironic. Often the picture is not a picture of a dog. So, low ratings are an indicator, though not a perfect one, that the picture does not contain a dog.

Observation: There are two unusually high ratings in the set. The first is a rating of 1776/10. This is for a dog dressed up in the colors and symbols of the American flag and posted on July 4, the American Independence Day, where 1776 is the famous year of American independence. The rating is clearly non-serious; rather, it is a patriotic gesture. The rating should be removed and replaced with pd.NA.

The second is a rating of 420/10. The score of 420 is a reference to the smoking of marijuana and the picture is not that of an actual dog, but of the rapper "Snoop Dogg". This row should be removed.

df_archive0: Drop the row with index 2074. For the row with index 979, replace the numerator and denominator ratings with pd.NA.

Code
Test

df_archive0: Change the rating of 1165 to 13/10 and change 1202 to 11/10.

Code
Test

Merge (outer) df_predictions0 with df_archive0, using tweet_id as the new index

Code
Test

Save Archive Master

Analyzing and Visualizing

Time Period

The time period for this study about "We Rate Dogs" is a period of 624 days from November 2015 until August 2017.

What are the most popular dogs names in the posts from "We Rate Dogs" during this time period? We can see below that the answer for the top four names is: 'Charlie', 'Oliver', 'Lucy', and 'Cooper'.

Retweet Count vs Favorite Count

One might expect the retweet_count to be strongly correlated with the favorite_count. We will investigate that to see if that expectation is confirmed.

With a correlation of 0.926, the two features favorite_count and retweet_count are strongly positively correlated, as one might expect.

We can perform a linear regression with retweet_count as the response and favorite_count as the predictor.

We can make a scatter plot of retweet_count versus favorite_count, adding in the regression line with parameters calculated above.

Favorite Count vs Rating

"We Rate Dogs" provides ratings as a ratio with a non-negative numerator and a denominator that is often 10 or some multiple of 10 -- for example "13/10". When the denominator is some multiple of 10, it is almost always because more than one dog is involved in the picture(s). For what follows, we will exclude cases with multiple dogs as best we can by only looking at ratings where the denominator is 10. (This technique is not a perfect filter for just posts about a single dog since sometimes a rating might be give as "12/10 for each one". But at least in that case the rating is clearly meant to apply to each individual rather than to the group in aggregate.) By selecting posts only where the denominator is 10, we can summarize the rating easily by simply looking at the rating numerator alone, which we will do.

The mean rating is 10.6 and the median is 11. The bulk of the ratings, the three upper quartiles, are greater than or equal to 10. This is also evident in the histogram below which shows a left-skewed distribution.

We can plot favorite_count versus rating_numerator to see whether there is any apparent relationship. Below we see that higher favorite counts are associated with higher ratings.

We see that the value of 10 bifurcates the ratings into two groups: less than 10 and greater than or equal to 10. We saw previously that the number of ratings increases substantially at 10. Plus, we see here that higher frequency counts only occur for ratings greater than or equal to 10. So, in examining favorite count versus rating further, we will divide the data according to this natural bifurcation.

Nearly all of the favorite counts for the low ratings (< 10) we saw were less than 20000 and most of them were less than 5000. That range of counts only rises to the first gradation above 0 on this plot for the high ratings (>= 10). The high ratings see a much higher range of favorite counts than the low ratings.

Ratings as Dog Predictors

It was observed informally that low ratings sometimes indicate that a picture does not contain a dog but rather a different kind of animal (like a hedge hog, for example). We will investigate using ratings as an indicator for whether the post concerns a dog. The question we would like to investigate here is, "Does 'We Rate Dogs' use low ratings to indicate that the post is not about a dog?" Or more precisely, "How strong of an indicator is the rating for the presence of a dog in the post?" To do this, we would need to know which posts concern a dog and which do not. Apart from recording that information for each post by means of human observation (a tedious task), we will use the machine learning (ML) predictions data provided to us as a benchmark for whether a post concerns a dog. This is an imperfect benchmark, since the prediction data itself should be tested for reliability. So, the more accurate way of looking at this investigation is as a comparison of using the ratings as dog predictors with the machine learning prediction data provided. The ML prediction data provides three ranked predictions -- first, second, and third in confidence. We will compare the ratings as dog predictors with (1) the top ranked predictions of the ML algorithm as well as with (2) the conjunction of the ranked predictions. The latter case is a conservative prediction and will only predict a dog when all three ranked predictions say there is a dog (i.e., the conjunction of the predictions); otherwise, the default will be that there is not a dog.

In the plot, the proportion of orange to blue per bar increases roughly as the ratings increase. This means that the proportion of ML predictions of "Dog" to "No Dog" increases roughly as the ratings increase. (Or, from the opposite perspective, lower ratings are stronger indicators of "No Dog".) This increase is particularly strong for ratings less than 10 and is less so (it appears to roughly flatten out) for ratings greater than or equal to 10.

This relationship suggests that the dog ratings themselves can be viewed somewhat as predictors of "Dog" versus "No Dog". We will examine this using logistic regression over a few cases:

1) Rating as a predictor of Top Ranked ML Dog Predictions
2) Rating as a predictor of the Conjunction of ML Dog Predictions
3) Rating as predictor of Top Ranked ML Dog Predictions for low ratings (< 10)
4) Rating as predictor of Top Ranked ML Dog Predictions for high ratings (>= 10)

Interpretation
The resulting logistic model is:
$log_e \Big( \frac{p_i}{1 - p_i} \Big) = -1.933 + 0.289\times \text{rating}$.

Where $ \frac{p_i}{1 - p_i} $ is the odds of the ML algorithm predicting a "Dog" given a particular $\text{rating}$, the odds ratio is
$\text{Odds Ratio} = \frac{\text{Odds given rating = x + 1}}{\text{Odds given rating = x}}$.
It can be shown that
$\text{Odds Ratio} = exp(\text{coeff. of rating})$
so that in this case the odds ratio is $exp(0.289) = 1.34$ which indicates that we expect a multiplicative change of 1.34 in the odds for an increase in rating by 1. In other words, the model predicts a 34% increase in the odds of the ML algorithm predicting a "Dog" for each increase in rating by 1.

Logistic Regression: Rating as predictor of the Conjunction of ML Dog Predictions

Interpretation
The resulting logistic model is:
$log_e \Big( \frac{p_i}{1 - p_i} \Big) = -1.876 + 0.218\times \text{rating}$.

Where $ \frac{p_i}{1 - p_i} $ is the odds of the ML algorithm predicting a "Dog" given a particular $\text{rating}$, the odds ratio is
$\text{Odds Ratio} = \frac{\text{Odds given rating = x + 1}}{\text{Odds given rating = x}}$.
It can be shown that
$\text{Odds Ratio} = exp(\text{coeff. of rating})$
so that in this case the odds ratio is $exp(0.218) = 1.24$ which indicates that we expect a multiplicative change of 1.24 in the odds for an increase in rating by 1. In other words, the model predicts a 24% increase in the odds of the ML algorithm predicting a "Dog" for each increase in rating by 1.

The odds ratio for the Top Ranked ML Prediction was 1.34 and, in this case, the odds ratio for the Conjunction of ML Predictions was 1.24. So, whichever ML prediction technique we use, the "We Rate Dogs" rating itself seems to track with the ML algorithm in a loose way.

Next we look at how the ratings compare when partitioned into low (< 10) and high (>= 10). We expect that the ratings will be a stronger indicator over the low partition than the high.

Logistic Regression: Rating as predictor of Top Ranked ML Dog Predictions for low ratings (< 10)

Interpretation
The resulting logistic model is:
$log_e \Big( \frac{p_i}{1 - p_i} \Big) = -3.167 + 0.452\times \text{rating}$.

Where $ \frac{p_i}{1 - p_i} $ is the odds of the ML algorithm predicting a "Dog" given a particular $\text{rating}$, the odds ratio is
$\text{Odds Ratio} = \frac{\text{Odds given rating = x + 1}}{\text{Odds given rating = x}}$.
It can be shown that
$\text{Odds Ratio} = exp(\text{coeff. of rating})$
so that in this case the odds ratio is $exp(0.452) = 1.57$ which indicates that we expect a multiplicative change of 1.57 in the odds for an increase in rating by 1. In other words, the model predicts a 57% increase in the odds of the ML algorithm predicting a "Dog" for each increase in rating by 1.

Logistic Regression: Rating as predictor of Top Ranked ML Dog Predictions for high ratings (>= 10)

Interpretation
The resulting logistic model is:
$log_e \Big( \frac{p_i}{1 - p_i} \Big) = 0.239 + 0.0978\times \text{rating}$.

Where $ \frac{p_i}{1 - p_i} $ is the odds of the ML algorithm predicting a "Dog" given a particular $\text{rating}$, the odds ratio is
$\text{Odds Ratio} = \frac{\text{Odds given rating = x + 1}}{\text{Odds given rating = x}}$.
It can be shown that
$\text{Odds Ratio} = exp(\text{coeff. of rating})$
so that in this case the odds ratio is $exp(0.0978) = 1.10$ which indicates that we expect a multiplicative change of 1.10 in the odds for an increase in rating by 1. In other words, the model predicts a 10% increase in the odds of the ML algorithm predicting a "Dog" for each increase in rating by 1.

So, when comparing the use of rating as a predictor of the ML algorithm's top ranked prediction providing a verdict of "Dog", the lower ratings yield an odds ratio of 1.57 as compared to an odds ratio of 1.10 for the upper ratings. At the lower ratings, differences in the ratings yield more information about whether there is a dog or not as the subject of the post (taking the ML algorithm's predictions as a baseline).