Portfolio | Brian Allan Woodcock

Welcome to my data science project portfolio. Since 2019, I have been acquiring the skills of a data scientist via online courses offered by Udacity, Coursera, and Udemy. Through these courses and accompanying projects, I have learned to code in Python, R, and SQL. You will see examples of my use of these languages below to gather, clean, wrangle, explore, model, visualize, analyze, and present data. Completion of each course required the creation of a project of a certain sort. Details about project requirements, including the course it was created for, can be found in the project descriptions. Verifiable certificates of completion for each course are listed on my LinkedIn page under "Licenses & certifications." To learn more about me, visit my website.

July 31, 2022

Exploration of 2019 Wage Income in Minnesota
by Data Visualizations

This exploration looked at wage income in Minnesota as reported in the 2019 American Community Survey (ACS) administered through the US Census Bureau. Observations include the following. (a) Wage income was clearly positively correlated with educational attainment. (b) Average wages increased with age into the 40s where they plateau until they begin to decline in the 60s. (c) Whites and Asians had higher average wages than the other major racial/ethnic categories (Blacks, Hispanics, and American Indians). Educational attainment appears to be a factor in explaining this difference. (d) Higher average wages were associated with being married with the spouse present. (e) Differences in average wage income based on sex (males earning higher wages on average than females) existed across the various categories of the factors of age, educational attainment, race/ethnicity, and marital status, highlighting the overall fact of difference as fairly robust.

Project Description (GitHub)

Summary Report

Full Report (From Jupyter Notebook)

Slide Presentation

March 12, 2021

Wrangling and Analyzing Data from “We Rate Dogs” on Twitter

This project looked at tweets from "We Rate Dogs" over a period of 624 days from November 2015 until August 2017. These tweets are often accompanied by pictures as well as ratings. Not all pictures and posts concern dogs, but most do. Data was gathered from the Twitter API. Assessment identified eleven data quality issues and six tidiness issues. After much cleaning and wrangling, a master dataset was produced. Exploration resulted in the following insights. (a) The four most popular dog names in the tweets were 'Charlie', 'Oliver', 'Lucy', and 'Cooper'. (b) Retweet count and favorite count are strongly positively correlated, as one might expect, with a correlation coefficient of 0.926 which corresponds to a R-squared value of 0.857. (c) Higher favorite counts are associated with higher ratings. (d) Finally, using logistic regression, models were devised to answer the question "How strong is the rating as an indicator for the presence in the post of a dog as opposed to some other kind of animal?"

Project Description (GitHub)

Report of Results

Full Report (From Jupyter Notebook)

Data Wrangling Report

October 8, 2020

Analyze A/B Test Results

Guided by a template, statistical analysis was performed on almost 300,000 rows of website user data detailing whether a user received a new or old (“treatment vs control”) version of a website landing page and whether they “converted” (i.e., bought the company’s product) or not. Analysis was performed in three separate ways: a built-in function, a bootstrapping simulation, and logistic regression. It was determined that the new page did not result in a statistically significant difference in conversion rate.

Project Description (GitHub)

Full Report (From Jupyter Notebook)

June 5, 2020

Investigating Fertility Rates Across Countries of the World

How have fertility rates across countries of the world changed over time? How have fertility rates been distributed across regions of the world recently (2011)? Are fertility rates correlated with other factors? These questions were investigated using Gapminder data. It was observed that fertility rates across countries have been decreasing over time, tending toward rates between 2 and 3 children per woman. In most regions of the world (the Americas, East Asia & the Pacific, the Middle East & North Africa, and South Asia), average regional fertility rates were between 2 and 3, and less than 2 in Europe and Central Asia. The noticeable exceptions were the countries of Sub-Saharan Africa with an average fertility rate of just under 5. It was observed from plots that fertility rates are negatively correlated with income (GDP per capita), life expectancy, sanitation, female literacy, and contraceptive use. One might think that fertility rates would also be correlated with the percentage of females that have joined the work force, but that was not observed.

Project Description (GitHub)

Full Report (From Jupyter Notebook)

March 23, 2020

Extracting, Smoothing, and Visualizing Weather Trend Data

Temperature data made available by Udacity and obtained via SQL was examined over a period of 264 years (1750 to 2013). Smoothed raw temperature data was visualized to see global and local (Minneapolis) temperature trends. Smoothing and plotting were accomplished in multiple ways involving SQL, Google Sheets, and R. Observations include the following. (a) Minneapolis has been roughly 3 degrees (Celsius) cooler on average that the global average yearly temperature. (b) The average global yearly temperature hovered around 8 degrees (C) until the early twentieth century. (c) The Minneapolis average yearly temperature stayed between roughly 4 and 5 degrees (C) until the early twentieth century. (d) There is an upward trajectory to the temperature, both globally and locally, starting in the early twentieth century and resulting in an increase in average yearly temperature of 1 to 1.5 degrees Celsius as compared with the previous century.

Project Description (GitHub)

Full Report (From R Markdown)

December 19, 2019

Multiple Linear Regression Modeling and Movie Prediction using Data from Rotten Tomatoes and IMDb

Data from Rotten Tomatoes and IMDb for movies between 1970 and 2014 was used to make linear regression models, taking Rotten Tomatoes' audience score and the IMDb rating as representatives of general popularity. These variables are highly correlated with one another. To find predictors for these variables, Rotten Tomatoes’ audience score was focused on. As expected, critics score was a good predictor. In addition, whether or not a movie was a documentary, a horror, or fit in the genre “Musical & Performing Arts” also was significant and found to make a difference in adjusted R-squared. In making predictions, however, the final models have large uncertainties. So, when using the models, predictions (for such movies as "Knives Out" and "10 Cloverfield Lane") were not that accurate.

Project Description (GitHub)

Full Report (From R Markdown)

November 9, 2019

Interactive Python Program to Explore US Bikeshare Data

This interactive Python program provides the user with options for exploring randomly selected bikeshare data from the first six months (January through June) of 2017 for three large U.S. cities: Chicago, New York City, and Washington DC. After selection of the city and the time period of interest by the user, information is provided about the most frequent times of travel, the most popular stations, trip duration statistics, and user statistics.

Project Description (GitHub)

October 17, 2019

Statistical Inference with Data from the General Social Survey

This investigation of General Social Survey data found that there was not a statistically significant difference in U.S. confidence in science based on political stance (left-leaning, moderate, or right-leaning) in 2012.

Project Description (GitHub)

Full Report (From R Markdown)

October 7, 2019

SQL Queries of a Relational Database for a DVD Rental Company

Four questions to explore a DVD rental database were developed and SQL queries were devised to answer them. (1) What is the percentage of overall rentals for each film category grouped by performance quartiles? (2) What is the top rented film overall, and what are the top 5 films rented by customers who rented the top rented film? (3) What are the top 10 films (by rental quantity) rented by the top 100 customers (by rental quantity), and what is the inventory for each of these films? (4) What percentage of sales from family-friendly movies comes from each category of family-friendly movies? A Google slide presentation with visualizations for addressing these questions was created.

Project Description (GitHub)

Slide Presentation

July 30, 2019

Exploratory Data Analysis of 2013 Behavioral Risk Factor Data

Using 2013 data from the Behavioral Risk Factor Surveillance System, this exploratory data analysis examined three questions by means of plots and summary statistics: (1) Are unemployed females aged 18-64 more likely to have health care coverage than unemployed males? The investigation found a difference of 2.9% in favor of females. (2) What is the difference between the mean number of days physical health was not good in the previous month as reported by those who said they were in excellent general health versus those who said they were in poor general health? A large difference of roughly 22 days was found. (3) What are the top three auxiliary forms of exercise reported by female dancers? By male dancers? How do the lists compare? Female dancers listed "walking", "no other activity", and "running" as their top alternatives. Male dancers listed "walking", "no other activity", and "weight lifting" as theirs. So, the lists agree on the first two.

Project Description (GitHub)

Full Report (From R Markdown)

Project Portfolio

Brian Allan Woodcock