Hi all,

We are working on a data model for storing tweets for multiple streams
(where a stream is defined by a number of keyword filters on the full
twitter firehose), and retrieving the tweets by timestamp and hashtag. My
question is whether the following data model would a good way for doing
that, where I'm creating a column name index for the hashtags.

ColumnFamily: streamTweets
     key: streamID + dayTimestamp (creating daily buckets for each stream)
     columns => name: TimeUUID, value: tweet json (storing all the tweets
for this stream in a wide row with a TimeUUID)

ColumnFamily: streamHashTagTweets
     key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
     columns => name: TimeUUID (referencing the TimeUUID value in the
streamTweets ColumnFamily), value: tweetID

Retrieving the latest 1000 tweets (of a given day) is trivial by requesting
the streamTweets columnFamily. Getting the latest tweets for a given
hashtag would mean you have to get the TimeUUIDs from the
streamHashTagTweets first, and then do a second get call on the
streamTweets with the former TimeUUIDs as the list of columns we like to
retrieve (column index).

Is referencing column names (TimeUUIDs) a smart thing to do when we have
wide rows spanning millions of columns? It seems easier (one reference
call) to do this, then it is to reference key values and running a
multi-get to get all the rows (we want to get up to 1000 tweets).
Referencing key values requires another columnFamily for tweets (key:
tweetId, columns: 1 column with data).

Of course we could instead denormalize the data and store the tweet also in
the streamHashTagTweet columns, but we want to do the same thing for other
indexes as well (topics, twitter usernames, links, etc), so it quickly adds
up in required storage space. Next to that we will request tweets by these
secondary indexes quite infrequently, while the tweets by timestamp will be
requested heavily.

Given we are estimating to store many TBs of tweets, we would prefer
setting up machines with spinning disks (2TB per node) to save costs.

We would love to hear your feedback.

Cheers,
Coen

Reply via email to