WWDC Analysis 3

We are now going to do a basic sentiment analysis of the WWDC Twitter data. That is, an analysis of how positive or negative tweets are during WWDC. I'm sure there are more sophisticated ways to do this, but this is just a first pass.

We start by reading in our organic/genuine tweets from our first analysis.

#Load organics pkl file
organics = pd.read_pickle("organics.pkl")

We are now going to attach a score to each tweet describing how positive or negative the tweet is. Here is how we do it. We take a dictionary, AFINN-111, with words associated to integers: postive words like "love" have a postive integer score and negative words like "hate" have a negative integer score. Then for each tweet we split the text into words and add the scores corresponding to each word in the tweet text. The total is called the sentiment score of the tweet.

#Load in sentiment file AFINN-111 as dictionary
sent_file = open('AFINN-111.txt')
sentiment_dict = {}
for line in sent_file:
  term, score  = line.split("\t")
  sentiment_dict[term] = int(score)
    
def sentiment_count(text, sentiment_dict):
    # Initialize
    sent_score = 0.
    word_count = 0.
    sent_buck = {}
    sent_buck['positive'] = 0.
    sent_buck['negative'] = 0.

    tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
    tokens = tknzr.tokenize(re.sub(r'#', ' ',text.lower()))

    #Remove stopwords and punctuation
    my_punctuations = list(string.punctuation)
    my_punctuations.extend(["'s","``","...","n't","''","amp","..",u'\u2019',u'\u201c',u'\ud83c',u'\u201d',u'\u2026',u'\u2019'])
    my_stopwords = stopwords.words('english')
    my_stopwords.extend(["it's"])


    for word in tokens:
        if word in my_stopwords:
            continue
        if word in my_punctuations:
            continue
        if sentiment_dict.has_key(word):
            if sentiment_dict[word]>0:
                sent_buck['positive'] += float(sentiment_dict[word])
            elif sentiment_dict[word]<0:
                sent_buck['negative'] += float(sentiment_dict[word])
        word_count += 1.

    if word_count == 0:
        sent_score = 0
    else:
        sent_score = (sent_buck['positive']+sent_buck['negative'])

    return sent_score    

apply_sent  =  lambda x: sentiment_count(x, sentiment_dict)
organics['sentiment'] = organics['text'].apply(apply_sent, sentiment_dict)

organics['sentiment'].describe()

Here is the output:

count    65018.000000
mean         0.710342
std          2.056135
min        -20.000000
25%          0.000000
50%          0.000000
75%          2.000000
max         19.000000
Name: sentiment, dtype: float64

This means that average sentiment is positive which is good for Apple. The most negative tweet had a score of -20 and the most positive tweet had a score of 19. Let's take a look at some example texts and sentiment scores.

#Well, we can argue about the quality of these sentiment scores.  But it's not terrible for a first try.
print organics[['text','sentiment']][:20]

Here is the output:


                                                 text  sentiment
0               woke up right on time for the #wwdc16        0.0
1   This is gonna be big and exciting..... Siri is...        4.0
2   Live From Apple's WWDC 2016: If you can't live...        0.0
3   And I am ready to attend #WWDC2016 virtually :...        0.0
4   Where to watch the live stream for today's App...        0.0
7   Live From Apple's WWDC 2016: If you can't live...        0.0
9   Where to watch the live stream for today's App...        0.0
11  in other amazing news,\n\n#WWDC is today!!! ...          4.0
14                        Is gonna be awesome #WWDC16        4.0
15  Live From Apple's WWDC 2016: If you can't live...        0.0
17  5 of the juiciest rumors about #Apple's #WWDC2...        0.0
18  When we are old and gray, weighed down by fath...       -1.0
21  My big conspiracy theory for #WWDC2016 @SAP @I...        0.0
23  Live From Apple's WWDC 2016: If you can't live...        0.0
27  After looking at the pictures I don't think I ...        0.0
33  Live From Apple's WWDC 2016: If you can't live...        0.0
39  Live From Apple's WWDC 2016: If you can't live...        0.0
43   praying for a new MacBook 13 inch #WWDC2016             1.0
44  Live From Apple's WWDC 2016: If you can't live...        0.0
46  Am following the right people for all the #wwd...        0.0

Let's take a look at the most negative tweet and the most positive tweet during WWDC.


print "Most Negative Tweet: \n"
min_index = organics['sentiment'].argmin()
print organics.ix[min_index]['text']

print "Most Positive Tweet: \n"
max_index = organics['sentiment'].argmax()
print organics.ix[max_index]['text']

Here is the output:

Most Negative Tweet:
NO NEW MAC HARDWARE?! WHAT THE FUCK. WHAT THE FUCK. FUCK. FUCK. FUCK. FUUUUCK. #WWDC2016

Most Positive Tweet:
@rwenderlich the most amazing, super cool, super great, most interesting WWDC ever. I think users are going to love it.

At least this looks pretty reasonable text for most negative and positive tweet during WWDC. The negative tweeter does indeed seem upset.

We are now going to create a time series visualization based on our sentiment scores of tweets.

#create formatted data for time series
sentiments_data = organics[['created_at','sentiment','text']]
sentiments_data['created_at'] = [tweetTime['$date'] for tweetTime in sentiments_data['created_at']]
sentiments_data['created_at'] = pd.to_datetime(Series(sentiments_data['created_at']))
sentiments_data = sentiments_data.set_index('created_at',drop=False)

# Time series tweet sentiment
import seaborn as sns
sns.set()

pal = sns.dark_palette("green", 3, reverse=True)

# Different resampling rates
x5 = sentiments_data['sentiment'].resample('1t').mean()
x20 = sentiments_data['sentiment'].resample('2t').mean()
x60 = sentiments_data['sentiment'].resample('5t').mean()

x5.plot(color=pal[0],lw=3,alpha=.5)
x20.plot(color=pal[1],lw=3,alpha=.75)
x60.plot(color=pal[2],lw=1.5)


fig = plt.gcf()
ax = plt.gca()
fig.set_size_inches(12, 8)

# Labels
plt.xlabel('Date',fontsize=20)
plt.ylabel('Text Sentiment Score', fontsize=20)
plt.title('Average Tweet Sentiment', fontsize=20)
# Legend
leg = plt.legend(['1 min', '2 min', '5 min'], fontsize=12, title='Resampling Rate')
plt.setp(leg.get_title(),fontsize='15')
# Axes
plt.setp(ax.get_xticklabels(), fontsize=14, family='sans-serif')
plt.setp(ax.get_yticklabels(), fontsize=18, family='sans-serif')
plt.tight_layout()


Here is what is going on. At 17:00, WWDC starts and average Twitter sentiment increases. Looking at the tweets it seems like a lot of people are just excited WWDC is starting. There are a lot of tweets regarding the moment of silence for the Orlando shootings. The large drop around 17:20 is during the watchOS demo. There are a lot of tweets saying there were no serious advancements to watchOS (i.e. Minnie Mouse does not count as an advancement). Many negative tweets during this time used the word "boring." The large jump around 18:40 is when universal clipboard is announced. Also, 18:45 is when Apple Pay for the web is announced and Siri for Mac is announced. Apparently, people loved this. The drop after 19:00 corresponds to the mixture tweets expressing amazement or, alternatively, expressing dissapointment over the event as a whole.

There is a lot more to do with the WWDC Twitter data. I hope to post more soon.