This is a question on general usage/best practice/best transformation method to use for a sentiment analysis on tweets...
Input:
Tweets (e.g, "@xyz, sorry but this movie is poorly scripted
http://t.co/uyser876") - large data set, ie. 1 billion tweets
Sentiment dictionary (e.g, "sorry" -> positive score 0, negative
score 0.97) - fixed data set, 200K words
Output:
tweet positive total negative total
1 0.00 3.4
2 0.875 0.12
...
The implementation idea I have (and it worked) is
- turn the sentiment dictionary into data frame (with a schema of "word
posScore NegScore"), register as a table
- turn tweets into a dataframe ("body")
- Using Spark SQL to find matches and aggregate scores, like this:
SELECT t.body, sum(s.PosScore), sum(s.NegScore)
FROM TweetsDF t, sentiment_dictionaryDF s
WHERE not (t.body is null)
and locate(upper(s.SynsetTerms), upper(t.body)) > 0
GROUP BY t.body LIMIT 100
This works, BUT EXTREMELY SLOW.
Though the 200K-word dictionary isn't that big, I supposed the LOCATE
function is really slow...
Or using SQL is completely the wrong tool to use here?
How else should I transform tweet and/or sentiment dictionary to speed up
the code?
JESSE CHEN
Big Data Performance | IBM Analytics
Email: [email protected]
