Data Wrangling and Visualization on WeRateDogs Tweets Information

Introduction

WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though almost always greater than 10, e.g., 11/10, 12/10, 13/10, etc., Why? Because "they're good dogs".

WeRateDogs has over 4 million followers and has received international media coverage.

Since real-world datasets rarely come clean, I will be using Python and its libraries to gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, and then clean it. This is called data wrangling!

Data description

Twitter archive: The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column in the archive does contain each tweet's text where ratings, dog name, and dog stage (i.e., doggo, floofer, pupper, and puppo) were extracted. To make this Twitter archive enhanced, out of the 5000+ tweets, only those with ratings were filtered (2366).

Image prediction: Every image in the WeRateDogs Twitter archive was classified into different breeds using a neural network algorithm. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4, since tweets can have up to four images).

Additional tweet data: Additional tweet information will be acquired from Twitter to address and bring out various insights on WeRateDog tweets, such as the number of likes and retweets, etc.

Data Gathering Section

Before importing and gathering additional tweet data, it will be reasonable to import all necessary libraries.

Loading twitter archive dataset

Getting the image prediction for different dogs at Udacity server

Loading image_predictions.tsv file into a dataframe

⚠⚠⚠⚠
Attention!

Since I previously retrieved the data required for this study using my Twitter API, there is no need to bother running the codes below to obtain more Twitter data. You can find all of the data used in this analysis in the data directory.

Getting more data information about different tweets using the tweetid in twitter archive dataset

Loading the tweets retrieved from twitter into a dataframe

Assessing Data

Twitter archive file

Observation

Number of NA in the dataset

Ratings column

Descriptive analysis on numerical columns

Observation

Both minimum and maximum ratings numerator and denominator are wrong because ratings numerator or denominator are not suppose to be this low or high!

Observation

Observation

Observation

Tweet id column

Checking for duplicate values on tweet_id

Observation

All tweets are unique, there is no duplicate

Name column

Observation

There are some unknown dog names i.e., "a", "None", "such", "the", "this", "unacceptable", "very" etc., But all unknown names have the same pattern except for None (lowercase).

Dog stages

Observation

None is not saved as N/A

Assessing additional tweet information

Observation

All tweets are unique, there is no duplicate

Observation

Descriptive analysis on numerical columns

Assessing image prediction

Description of image prediction columns:

Observation

  • Data quality issue i.e., misrepresentation of tweet_id.
  • Also, inconsistent dog breed name i.e., uppercase and lowercase
  • Names separated with "_" instead of " ".

Descriptive analysis on numerical columns

Quality issues

Twitter archive table

Tweets table

Image prediction table

Tidiness issues

Twitter archive table

Image prediction table

Data Cleaning

Making a separate copy of each of the three datasets will make it simple to compare data with and without organisation and quality problems.

Data Quality 1

Define

Use pandas.to_datetime() function to convert timestamp column to datetime object and change the data type of id column to str. Make sure to sort the dataframe ascending base on timestamp

Code

Test

Tidiness issue 1

Define

Extract both time and date from the timestamp object and delete the timestamp column after extraction.

Code

Test

Data Quality 2

Twitter archive: Misrepresentation of NA values

Define

Use dataframe.replace() function to replace all the "None" strings in the dog_stages columns

Code

Test

Tidiness issue 2

Twitter archive: Dog stages requires only one column

Define

Use melt() method to get all values of dog stages in one column and remove all redundant dog stage columns i.e., 'doggo', 'floofer', 'pupper', 'puppo'

Test

Data Quality 3

Redundant urls in a source column

Define

Use regular expression pattern to extract the text in between the \ tag

Code

Test

Data Quality 4

Tweet archive: Unknow url at the end of tweet texts.

Define

Remove the url at the end of each tweet text using replace() method in str

Code

Test

Data Quality 5

Twitter archive: Invalid dogs name

Define

Remove all dogs name with lower case and replace all "None" with np.nan in the dataframe.

Code

Test

Data Quality 6

Twitter archive: Invalid ratings representation

Define

Extract both ratings numerator and denominator from the text column. Make sure not to extract date as ratings! And use astype() method to change their type from str to float.

Code

Test

Data Quality 7

Twitter archive: Huge ratings numerator and denominator are invalid

Define

Use pandas query() method to extract dataframe with ratings denominator of 10 and ratings numerator <= 14. Make sure to rename the ratings numerator to final ratings and drop the denominator since all ratings are consistence.

Code

Test

Data Quality 8

Twitter archive: Retweeted tweets and replies not needed.

Define

Remove all retweeted tweets and replies

Code

Test

Data Quality 9

Twitter archive: Retweet columns and replies are no longer needed

Define

Drop all retweet and replies columns using drop() method in pandas

Code

Test

Data Quality 10

Define

Change the datatype of tweet_id column in both image prediction and tweets additional info table to string using astype() method in pandas

Code

Test

Tidiness issue 3

Image prediction: Only best image prediction is needed

Define

Take the best image prediction for the images in the tweets. It will be reasonable to drop all other columns except tweet id since best prediction has been taken.

Code

Test

Data Quality 11

Image prediction: Inconsistence dog breeds

Define

Replace "_" with " " and capitalize all dog breeds.

Code

Test

Tidiness issue 4

There are 3 Dataframe which suppose to be 1 because they are all referring to tweets details

Define

Merge the three dataframes together base on their tweet id

Code

Test

Data Quality 12

Tweet without dog image found

Define

Remove all tweet rows without dog images (jpg_url)

Code

Test

Data restructuring

I will change the following columns name to:

And also make tweet_id as the index of the dataframe

Column position restructuring

Storing Data

Analyzing and visualizing data

Observation

Confirming the correlation between Likes and retweets overtime

observation

Ratings distribution

Observation

Dogs with highest ratings

Observation

Most common dogs breed

Top 10 tweeted dog breeds

Observation

Rare tweeted dog breeds

Observation

Top 5 dogs with highest Likes

Dog with highest likes

The most tweeted dog is said to be cute and adorable 😀

Top 5 dogs with highest retweets

Dog with highest retweets

The most retweeted dog is said to be found swimming 🥰