This is a question on general usage/best practice/best transformation
method to use for
a sentiment analysis on tweets...

Input:
        Tweets (e.g, "@xyz, sorry but this movie is poorly scripted
http://t.co/uyser876";) - large data set, ie. 1 billion tweets
        Sentiment dictionary (e.g, "sorry" -> positive score 0, negative
score 0.97) - fixed data set, 200K words

Output:
        tweet   positive total          negative total
        1       0.00                    3.4
        2       0.875                   0.12
        ...

The implementation idea I have (and it worked) is

- turn the sentiment dictionary into data frame (with a schema of "word
posScore NegScore"), register as a table
- turn tweets into a dataframe ("body")
- Using Spark SQL to find matches and aggregate scores, like this:

SELECT t.body, sum(s.PosScore), sum(s.NegScore)
          FROM TweetsDF t, sentiment_dictionaryDF s
          WHERE not (t.body is null)
                and locate(upper(s.SynsetTerms), upper(t.body)) > 0
          GROUP BY t.body LIMIT 100

This works, BUT EXTREMELY SLOW.

Though the 200K-word dictionary isn't that big, I supposed the LOCATE
function is really slow...

Or using SQL is completely the wrong tool to use here?

How else should I transform tweet and/or sentiment dictionary to speed up
the code?

                                                                                
                                                              
                                                                                
                                                              
                                                                                
                                                              
                                                                                
                                                              
                                                                                
                                                              
                                                                                
                                                              
                                   JESSE CHEN                                   
                                                              
                                   Big Data Performance | IBM Analytics         
                                                              
                                   Email:   [email protected]                   
                                                              
                                                                                
                                                              
                                                                                
                                                              

Reply via email to