Thanks for the feedback! We will be going forward by implementing and deploying the proposed model, and test it out.
Cheers, Coen On Thu, Apr 11, 2013 at 12:21 PM, aaron morton <aa...@thelastpickle.com>wrote: > Retrieving the latest 1000 tweets (of a given day) is trivial by > requesting the streamTweets columnFamily. > > If you normally want to get the most recent items use a reverse comparator > on the column name > see http://thelastpickle.com/2011/10/03/Reverse-Comparators/ > > Getting the latest tweets for a given hashtag would mean you have to get > the TimeUUIDs from the streamHashTagTweets first, and then do a second get > call on the streamTweets with the former TimeUUIDs as the list of columns > we like to retrieve (column index). > > You choices here depend on what sort of queries are the most frequent and > how much disk space you have. > > You current model makes sense if the stream by day is the most frequent > query, and you want to conserve disk space. If disk space is not an issue > you can denormalise further and store the tweet JSON. > > If you have potentially many streamHashTagTweets rows where a single tweet > is replicated it may make sense to stick with the current design to reduce > disk use. > > (we want to get up to 1000 tweets). > > If you want to get 1000 anything from cassandra please break the multiget > up into multiple calls. Each row request becomes a task in the thread pools > on RF nodes. If you have a small ish cluster one client asking for 1000 > rows will temporarily block other clients and hurt request throughput. > > Referencing key values requires another columnFamily for tweets (key: > tweetId, columns: 1 column with data). > > This will be a more efficient (aka faster) read than reading from the a > wide row. > > Next to that we will request tweets by these secondary indexes quite > infrequently, while the tweets by timestamp will be requested heavily. > > If the hot path is the streamTweets calls demoralise into that, and > normalise the tweet storage into it's own CF and reference them from > the streamHashTagTweets. Have a canonical store of the events / tweets / > entities addressable by their business key can give you more flexibility. > > Given we are estimating to store many TBs of tweets, we would prefer > setting up machines with spinning disks (2TB per node) to save costs. > > If you have spinning disks and 1G networking the rule of thumb is 300GB to > 500GB per node. See previous discussions about size per node. > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 10/04/2013, at 2:00 AM, Coen Stevens <beatle...@gmail.com> wrote: > > Hi all, > > We are working on a data model for storing tweets for multiple streams > (where a stream is defined by a number of keyword filters on the full > twitter firehose), and retrieving the tweets by timestamp and hashtag. My > question is whether the following data model would a good way for doing > that, where I'm creating a column name index for the hashtags. > > ColumnFamily: streamTweets > key: streamID + dayTimestamp (creating daily buckets for each stream) > columns => name: TimeUUID, value: tweet json (storing all the tweets > for this stream in a wide row with a TimeUUID) > > ColumnFamily: streamHashTagTweets > key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra) > columns => name: TimeUUID (referencing the TimeUUID value in the > streamTweets ColumnFamily), value: tweetID > > Retrieving the latest 1000 tweets (of a given day) is trivial by > requesting the streamTweets columnFamily. Getting the latest tweets for a > given hashtag would mean you have to get the TimeUUIDs from the > streamHashTagTweets first, and then do a second get call on the > streamTweets with the former TimeUUIDs as the list of columns we like to > retrieve (column index). > > Is referencing column names (TimeUUIDs) a smart thing to do when we have > wide rows spanning millions of columns? It seems easier (one reference > call) to do this, then it is to reference key values and running a > multi-get to get all the rows (we want to get up to 1000 tweets). > Referencing key values requires another columnFamily for tweets (key: > tweetId, columns: 1 column with data). > > Of course we could instead denormalize the data and store the tweet also > in the streamHashTagTweet columns, but we want to do the same thing for > other indexes as well (topics, twitter usernames, links, etc), so it > quickly adds up in required storage space. Next to that we will request > tweets by these secondary indexes quite infrequently, while the tweets by > timestamp will be requested heavily. > > Given we are estimating to store many TBs of tweets, we would prefer > setting up machines with spinning disks (2TB per node) to save costs. > > We would love to hear your feedback. > > Cheers, > Coen > > >