Re: Data modeling for read performance

Aaron Turner Thu, 17 May 2012 09:57:13 -0700

On Thu, May 17, 2012 at 8:55 AM, jason kowalewski
<jay.kowalew...@gmail.com> wrote:
> We have been attempting to change our data model to provide more
> performance in our cluster.
>
> Currently there are a couple ways to model the data and i was
> wondering if some people out there could help us out.
>
> We are storing time-series data currently keyed by a user id. This
> current approach is leading to some hot-spotting of nodes likely due
> to the key distribution not being representative of the usage pattern.
> Currently we are using super columns (the super column name is the
> timestamp), which we intend to dispose of as well with this datamodel
> redesign.
>
> The first idea we had is that we can shard the data using composite row
> keys into time buckets:
>
> UserId:<TimeBucket> : {
>  <timestamp>:<colname> = <col value1>,
>  <timestamp>:<colname2 = <col value2>
> ... and so on.
> }
>
> We can then use a wide row index for tracking these in the future:
> <TimeBucket>: {
>  <userId> = null
> }
>
> This first approach would always have the data be retrieved by the composite
> row key.
>
> Alternatively we could just do wide rows using composite columns:
>
> UserId : {
>  <timestamp>:<colname> = <col value1>,
>  <timestamp>:<colname2> = <col value2>
>
> ... and so on
> }
>
>
> The second approach would have less granular keys, but is easier to group
> historical timeseries rather than sharding the data into buckets. This second
> approach also will depend solely on Range Slices of the columns to retrieve
> the data.
>
> Is there a speed advantage in doing a Row point get in the first approach vs
> range scans on these columns  in the second approach? In the first approach
> each bucket would have no more than 200 events. In the second approach we
> would expect the number of columns to be in the thousands to hundreds of
> thousands... Our reads currently (using supercolumns) are PAINFULLY slow -
> the cluster is constantly timing out on many nodes and disk i/o is very high.
>
> Also, Instead of having each column name as a new composite column is it
> better to serialize the multiple values into some format (json, binary, etc) 
> to
> reduce the amount of disk seeks when paging over this timeseries data?
>
> Thanks for any ideas out there!



You didn't say what your queries look like, but the way I did it was:

<userid>|<stat_name>|<timebucket> : {
  <timestamp> = <value>
}

This provides very efficient read for a given user/stat combination.
If I need to get multiple stats per user, I just use more threads on
the client side.  I'm not using composite row keys (it's just
AsciiType) as that can lead to hotspots on disk.  My timestamps are
also just plain unix epoch's as that takes less space then something
like TimeUUID.



-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Re: Data modeling for read performance

Reply via email to