WWDC Analysis 1

We are going to do a lot of data cleaning in this first analysis. Our first goal will be to separate genuine tweets from mass marketing tweets and determine which terms are most popular in these genuine tweets.

import json
import string
import re
from collections import Counter
import pandas as pd
from pandas import DataFrame, Series
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
%matplotlib inline

path = 'wwdc_test.json'
record = [json.loads(line) for line in open(path)]
df = DataFrame(record)

df.count()

Here is the output:

_id                161813
coordinates           597
created_at         161813
followers_count    161813
friends_count      161813
hashtags           161813
screen_name        161813
source             161813
text               161813
dtype: int64

This is our starting data. That is, we start with 161813 tweets, and you can see that for each tweet, we have recorded the time it was created, text, screen_name, source, etc..

print df['text'].describe()['top']

Here is the output:

RT @pschiller: Stream On!
#WWDC16 https://t.co/tfCf6en942

Apparently this is the most retweeted tweet during WWDC. Philip W. Schiller is the senior vice president of worldwide marketing at Apple. He is a prominent figure in Apple's public presentations.

We will now analyze the sources of tweets. I.e. were the tweets made on an iPhone, iPad, or some mass marketing product?

#Sort sources by common use.  The point is that many of sources (like ShootingStarPorn) are not really genuine tweets.
#Also, twitterfeed and dlvr.it are social media platforms for deploying mass tweets. IFTTT is for marketers.
#We will want to take these out before doing our analysis
df.groupby('source').size().sort_values(ascending=False)

Here is the output:

source
Twitter for iPhone             43989
Twitter Web Client             33358
Twitter for Android            11338
TweetDeck                      11090
Tweetbot for Mac               10649
IFTTT                          10180
Tweetbot for iOS                9753
Twitter for Mac                 6747
Twitter for iPad                4798
twitterfeed                     3488
RoundTeam                       1324
dlvr.it                         1063
Hootsuite                       1013
Twitter for Windows              885
Mobile Web (M5)                  725
Facebook                         596
Twitterrific                     542
Echofon                          478
OS X                             476
Buffer                           386
Fenix for Android                365
ShootingStarPorn for API1.1      354
iOS Demo Application             354
YoruFukurou                      333
Twitter for Windows Phone        309
WordPress.com                    305
Instagram                        303
...  
dtype: int64
#Let's create a list of organic or genuine sources
organic_sources = ['Twitter for iPhone', 'Twitter Web Client','Twitter for Android','TweetDeck','Tweetbot for Mac'
                   'Tweetbot for iΟS','Twitter for Mac','Twitter for iPad','twitterfeed','Twitter for Windows'
                   'Twitterrific','Echofon','OS X','Twitter for Windows Phone','WordPress.com','Google'
                   'Facebook', 'Instagram']

We are now going to get rid of many tweets. That is, we will delete retweets, tweets that are not from genuine sources, and tweets that are only designed to get more followers on twitter.


#Let's remove retweets, remove inorganic sources, and retain only text-unique tweets.

uniques = df.drop_duplicates(inplace=False, subset='text')

organics = uniques[uniques['text'].str.startswith('RT')==False ]
organics = organics[organics['text'].str.startswith('rt')==False ]
# In case RT was placed further in the text than the beginning.
organics = organics[ organics['text'].str.contains(' RT ', case=False)==False ]

#Delete tweets from www.followersfree.net
organics = organics[ organics['text'].str.contains('LinkedIn', case=False)==False ]
organics = organics[ organics['text'].str.contains('#XboxE3', case=False)==False ]
organics = organics[ organics['text'].str.contains('#mondaymotivation', case=False)==False ]
organics = organics[ organics['text'].str.contains('#gettingtoknowReedfans', case=False)==False ]
organics = organics[ organics['text'].str.contains('XXL', case=False)==False ]
organics = organics[ organics['text'].str.contains('Lil', case=False)==False ]
organics = organics[ organics['text'].str.contains('Iniesta', case=False)==False ]
organics = organics[ organics['text'].str.contains('Freshman', case=False)==False ]

#Only keep organic sources
organics = organics[organics['source'].isin(organic_sources)]

print organics.count()

Here is the output:

_id                65018
coordinates          127
created_at         65018
followers_count    65018
friends_count      65018
hashtags           65018
screen_name        65018
source             65018
text               65018
dtype: int64

Thus, we see that only 65,018 of our starting 161,813 tweets are original tweets. As you may have originally guessed, the majority of tweets with #wwdc are not individuals really tweeting their own thoughts about the conference. There is a lot of marketing and product placement. And I probably didn't remove all of it.

Now we are going to make a histogram of the most popular terms in the text of WWDC tweets.

from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
organic_words = []
for index, row in organics.iterrows():
    organic_words.extend(tknzr.tokenize(re.sub(r'#', ' ',row['text'].lower())))
count_words = Counter(organic_words) 

#Remove stopwords and punctuation
my_punctuations = list(string.punctuation)
my_punctuations.extend(["'s","``","...","n't","''","amp","..",u'\u2019',
u'\u201c',u'\ud83c',u'\u201d',u'\u2026',u'\u2019'])
my_stopwords = stopwords.words('english')
my_stopwords.extend(["it's"])

for word in count_words.keys():
    if word in my_stopwords:
        count_words.pop(word, None)
        #print "Popping",word
    if word in my_punctuations:
        count_words.pop(word, None)
        #print "Popping",word
print count_words.most_common(100)

Here is the output:

[(u'wwdc', 64396), (u'2016', 44727), (u'apple', 19415), (u'ios', 8082), 
(u'new', 6422), (u'10', 6190), (u'\ud83d', 6080), (u'siri', 4790), (u'16', 4409), (u'watch', 4272), 
(u'app', 3989), (u'macos', 3421), (u'watchos', 3403), (u'keynote', 2869), (u'like', 2822), 
(u'live', 2462), (u'3', 2342), (u'apps', 2323), (u'mac', 2288), (u'music', 2111), (u'developers', 2091), 
(u'messages', 2046), (u'today', 1985), (u'time', 1857), (u"apple's", 1855), (u"i'm", 1719), 
(u'swift', 1694), (u'get', 1619), (u'iphone', 1598), (u'features', 1536), (u'imessage', 1477), 
(u'one', 1463), (u'coming', 1442), (u'os', 1417), (u'really', 1388), (u'looks', 1369), 
(u'finally', 1361), (u'great', 1289), (u'tim', 1283), (u'going', 1272), (u'feature', 1257), 
(u'tv', 1251), (u'cook', 1225), (u'watching', 1206), (u'see', 1163), (u'news', 1147), (u'pay', 1128), 
(u'sierra', 1108), (u'home', 1106), (u'tvos', 1093), (u'photos', 1075), (u'good', 1071), 
(u'updates', 1070), (u'love', 1065), (u'emoji', 1047), (u'maps', 1043), (u'use', 1041), 
(u'ipad', 1022), (u'cool', 1009), (u'people', 989), (u'thing', 986), (u'screen', 979), 
(u'moment', 932), (u'much', 928), (u'right', 922), (u'oh', 893), (u'orlando', 892), (u'big', 886), 
(u'still', 882), (u"can't", 878), (u'year', 870), (u"don't", 846), (u'x', 837), (u'silence', 834), 
(u'google', 821), (u'wait', 819), (u'\ude02', 806), (u'want', 795), (u'make', 788), (u'awesome', 777), 
(u'update', 774), (u'2', 765), (u'go', 751), (u'event', 750), (u'emojis', 741), (u'need', 732), 
(u'getting', 732), (u'well', 712), (u'touch', 711), (u'nice', 703), (u'developer', 702), 
(u'next', 695), (u'playgrounds', 693), (u'universal', 690), (u'web', 688), (u'stage', 686), 
(u'clipboard', 686), (u'better', 686), (u'excited', 684), (u"that's", 683)]

Let me emphasize that a lot of the code above was about cleaning the tweets to an extent where this list became a reasonable list of words. In case you are curious, u'\ud83d' and u'\ude02' are happy faces. The most common topics after the generic "WWDC" and "Apple" terms are about iOS and Siri. Let's create a D3 visualization for this information.



This list already says a lot. For instance, iOS and 10 (the iOS version) is mentioned 2 times more often than watchOS and macOS and thus provides some quantitative backing to the claim that Apple is now more focused on iPhones than their other devices like computers.

Time series analysis of WWDC twitter data is next.