On Thu, May 17, 2012 at 8:55 AM, jason kowalewski <jay.kowalew...@gmail.com> wrote: > We have been attempting to change our data model to provide more > performance in our cluster. > > Currently there are a couple ways to model the data and i was > wondering if some people out there could help us out. > > We are storing time-series data currently keyed by a user id. This > current approach is leading to some hot-spotting of nodes likely due > to the key distribution not being representative of the usage pattern. > Currently we are using super columns (the super column name is the > timestamp), which we intend to dispose of as well with this datamodel > redesign. > > The first idea we had is that we can shard the data using composite row > keys into time buckets: > > UserId:<TimeBucket> : { > <timestamp>:<colname> = <col value1>, > <timestamp>:<colname2 = <col value2> > ... and so on. > } > > We can then use a wide row index for tracking these in the future: > <TimeBucket>: { > <userId> = null > } > > This first approach would always have the data be retrieved by the composite > row key. > > Alternatively we could just do wide rows using composite columns: > > UserId : { > <timestamp>:<colname> = <col value1>, > <timestamp>:<colname2> = <col value2> > > ... and so on > } > > > The second approach would have less granular keys, but is easier to group > historical timeseries rather than sharding the data into buckets. This second > approach also will depend solely on Range Slices of the columns to retrieve > the data. > > Is there a speed advantage in doing a Row point get in the first approach vs > range scans on these columns in the second approach? In the first approach > each bucket would have no more than 200 events. In the second approach we > would expect the number of columns to be in the thousands to hundreds of > thousands... Our reads currently (using supercolumns) are PAINFULLY slow - > the cluster is constantly timing out on many nodes and disk i/o is very high. > > Also, Instead of having each column name as a new composite column is it > better to serialize the multiple values into some format (json, binary, etc) > to > reduce the amount of disk seeks when paging over this timeseries data? > > Thanks for any ideas out there!
You didn't say what your queries look like, but the way I did it was: <userid>|<stat_name>|<timebucket> : { <timestamp> = <value> } This provides very efficient read for a given user/stat combination. If I need to get multiple stats per user, I just use more threads on the client side. I'm not using composite row keys (it's just AsciiType) as that can lead to hotspots on disk. My timestamps are also just plain unix epoch's as that takes less space then something like TimeUUID. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin "carpe diem quam minimum credula postero"