We are going to do a lot of data cleaning in this first analysis. Our first goal will be to separate genuine tweets from mass marketing tweets and determine which terms are most popular in these genuine tweets.
import json
import string
import re
from collections import Counter
import pandas as pd
from pandas import DataFrame, Series
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
%matplotlib inline
path = 'wwdc_test.json'
record = [json.loads(line) for line in open(path)]
df = DataFrame(record)
df.count()
Here is the output:
_id 161813
coordinates 597
created_at 161813
followers_count 161813
friends_count 161813
hashtags 161813
screen_name 161813
source 161813
text 161813
dtype: int64
This is our starting data. That is, we start with 161813 tweets, and you can see that for each tweet, we have recorded the time it was created, text, screen_name, source, etc..
print df['text'].describe()['top']
Here is the output:
RT @pschiller: Stream On!
#WWDC16 https://t.co/tfCf6en942
Apparently this is the most retweeted tweet during WWDC. Philip W. Schiller is the senior vice president of worldwide marketing at Apple. He is a prominent figure in Apple's public presentations.
We will now analyze the sources of tweets. I.e. were the tweets made on an iPhone, iPad, or some mass marketing product?
#Sort sources by common use. The point is that many of sources (like ShootingStarPorn) are not really genuine tweets.
#Also, twitterfeed and dlvr.it are social media platforms for deploying mass tweets. IFTTT is for marketers.
#We will want to take these out before doing our analysis
df.groupby('source').size().sort_values(ascending=False)
Here is the output:
source
Twitter for iPhone 43989
Twitter Web Client 33358
Twitter for Android 11338
TweetDeck 11090
Tweetbot for Mac 10649
IFTTT 10180
Tweetbot for iOS 9753
Twitter for Mac 6747
Twitter for iPad 4798
twitterfeed 3488
RoundTeam 1324
dlvr.it 1063
Hootsuite 1013
Twitter for Windows 885
Mobile Web (M5) 725
Facebook 596
Twitterrific 542
Echofon 478
OS X 476
Buffer 386
Fenix for Android 365
ShootingStarPorn for API1.1 354
iOS Demo Application 354
YoruFukurou 333
Twitter for Windows Phone 309
WordPress.com 305
Instagram 303
...
dtype: int64
#Let's create a list of organic or genuine sources
organic_sources = ['Twitter for iPhone', 'Twitter Web Client','Twitter for Android','TweetDeck','Tweetbot for Mac'
'Tweetbot for iΟS','Twitter for Mac','Twitter for iPad','twitterfeed','Twitter for Windows'
'Twitterrific','Echofon','OS X','Twitter for Windows Phone','WordPress.com','Google'
'Facebook', 'Instagram']
We are now going to get rid of many tweets. That is, we will delete retweets, tweets that are not from genuine sources, and tweets that are only designed to get more followers on twitter.
#Let's remove retweets, remove inorganic sources, and retain only text-unique tweets.
uniques = df.drop_duplicates(inplace=False, subset='text')
organics = uniques[uniques['text'].str.startswith('RT')==False ]
organics = organics[organics['text'].str.startswith('rt')==False ]
# In case RT was placed further in the text than the beginning.
organics = organics[ organics['text'].str.contains(' RT ', case=False)==False ]
#Delete tweets from www.followersfree.net
organics = organics[ organics['text'].str.contains('LinkedIn', case=False)==False ]
organics = organics[ organics['text'].str.contains('#XboxE3', case=False)==False ]
organics = organics[ organics['text'].str.contains('#mondaymotivation', case=False)==False ]
organics = organics[ organics['text'].str.contains('#gettingtoknowReedfans', case=False)==False ]
organics = organics[ organics['text'].str.contains('XXL', case=False)==False ]
organics = organics[ organics['text'].str.contains('Lil', case=False)==False ]
organics = organics[ organics['text'].str.contains('Iniesta', case=False)==False ]
organics = organics[ organics['text'].str.contains('Freshman', case=False)==False ]
#Only keep organic sources
organics = organics[organics['source'].isin(organic_sources)]
print organics.count()
Here is the output:
_id 65018
coordinates 127
created_at 65018
followers_count 65018
friends_count 65018
hashtags 65018
screen_name 65018
source 65018
text 65018
dtype: int64
Thus, we see that only 65,018 of our starting 161,813 tweets are original tweets. As you may have originally guessed, the majority of tweets with #wwdc are not individuals really tweeting their own thoughts about the conference. There is a lot of marketing and product placement. And I probably didn't remove all of it.
Now we are going to make a histogram of the most popular terms in the text of WWDC tweets.
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
organic_words = []
for index, row in organics.iterrows():
organic_words.extend(tknzr.tokenize(re.sub(r'#', ' ',row['text'].lower())))
count_words = Counter(organic_words)
#Remove stopwords and punctuation
my_punctuations = list(string.punctuation)
my_punctuations.extend(["'s","``","...","n't","''","amp","..",u'\u2019',
u'\u201c',u'\ud83c',u'\u201d',u'\u2026',u'\u2019'])
my_stopwords = stopwords.words('english')
my_stopwords.extend(["it's"])
for word in count_words.keys():
if word in my_stopwords:
count_words.pop(word, None)
#print "Popping",word
if word in my_punctuations:
count_words.pop(word, None)
#print "Popping",word
print count_words.most_common(100)
Here is the output:
[(u'wwdc', 64396), (u'2016', 44727), (u'apple', 19415), (u'ios', 8082),
(u'new', 6422), (u'10', 6190), (u'\ud83d', 6080), (u'siri', 4790), (u'16', 4409), (u'watch', 4272),
(u'app', 3989), (u'macos', 3421), (u'watchos', 3403), (u'keynote', 2869), (u'like', 2822),
(u'live', 2462), (u'3', 2342), (u'apps', 2323), (u'mac', 2288), (u'music', 2111), (u'developers', 2091),
(u'messages', 2046), (u'today', 1985), (u'time', 1857), (u"apple's", 1855), (u"i'm", 1719),
(u'swift', 1694), (u'get', 1619), (u'iphone', 1598), (u'features', 1536), (u'imessage', 1477),
(u'one', 1463), (u'coming', 1442), (u'os', 1417), (u'really', 1388), (u'looks', 1369),
(u'finally', 1361), (u'great', 1289), (u'tim', 1283), (u'going', 1272), (u'feature', 1257),
(u'tv', 1251), (u'cook', 1225), (u'watching', 1206), (u'see', 1163), (u'news', 1147), (u'pay', 1128),
(u'sierra', 1108), (u'home', 1106), (u'tvos', 1093), (u'photos', 1075), (u'good', 1071),
(u'updates', 1070), (u'love', 1065), (u'emoji', 1047), (u'maps', 1043), (u'use', 1041),
(u'ipad', 1022), (u'cool', 1009), (u'people', 989), (u'thing', 986), (u'screen', 979),
(u'moment', 932), (u'much', 928), (u'right', 922), (u'oh', 893), (u'orlando', 892), (u'big', 886),
(u'still', 882), (u"can't", 878), (u'year', 870), (u"don't", 846), (u'x', 837), (u'silence', 834),
(u'google', 821), (u'wait', 819), (u'\ude02', 806), (u'want', 795), (u'make', 788), (u'awesome', 777),
(u'update', 774), (u'2', 765), (u'go', 751), (u'event', 750), (u'emojis', 741), (u'need', 732),
(u'getting', 732), (u'well', 712), (u'touch', 711), (u'nice', 703), (u'developer', 702),
(u'next', 695), (u'playgrounds', 693), (u'universal', 690), (u'web', 688), (u'stage', 686),
(u'clipboard', 686), (u'better', 686), (u'excited', 684), (u"that's", 683)]
Let me emphasize that a lot of the code above was about cleaning the tweets to an extent where this list became a reasonable list of words. In case you are curious, u'\ud83d' and u'\ude02' are happy faces. The most common topics after the generic "WWDC" and "Apple" terms are about iOS and Siri. Let's create a D3 visualization for this information.
This list already says a lot. For instance, iOS and 10 (the iOS version) is mentioned 2 times more often than watchOS and macOS and thus provides some quantitative backing to the claim that Apple is now more focused on iPhones than their other devices like computers.
Time series analysis of WWDC twitter data is next.