Re: Time-series data model

Ilya Maykov Thu, 15 Apr 2010 01:01:26 -0700

Hi Jean-Pierre,

I'm investigating using Cassandra for a very similar use case, maybe
we can chat and compare notes sometime. But basically, I think you
want to pull the metric name into the row key and use simple CF
instead of SCF. So, your example:

"my_server_1": {
       "cpu_usage": {
               {ts: 1271248215, value: 87 },
               {ts: 1271248220, value: 34 },
               {ts: 1271248225, value: 23 },
               {ts: 1271248230, value: 49 }
       }
       "ping_response": {
               {ts: 1271248201, value: 0.345 },
               {ts: 1271248211, value: 0.423 },
               {ts: 1271248221, value: 0.311 },
               {ts: 1271248232, value: 0.582 }
       }
}

becomes:

"my_server_1:cpu_usage" : {
               {ts: 1271248215, value: 87 },
               {ts: 1271248220, value: 34 },
               {ts: 1271248225, value: 23 },
               {ts: 1271248230, value: 49 }
},
"my_server_1:ping_response": {
               {ts: 1271248201, value: 0.345 },
               {ts: 1271248211, value: 0.423 },
               {ts: 1271248221, value: 0.311 },
               {ts: 1271248232, value: 0.582 }
       }

This keeps your rows smaller and row count higher (which I think will
load-balance better). It also avoids large super columns, which you
don't want because columns inside a super column are not indexed so
accessing them can be expensive.

The time-based sharding will be necessary eventually if you plan to
keep your data forever, because without it your rows will get so big
that they don't fit in memory and crash Cassandra during a compaction.
But realistically, Cassandra can support A LOT of columns and pretty
big rows. Suppose you sample your stats every minute and use
"device-id:metric-name" as the row key. Google calculator claims there
are ~526k minutes in a year, so if you keep high-resolution data
forever you would only have half a million columns per row after 1
year. Assuming 128 bytes per data point (which seems way high for a
(long, double, long) 3-tuple), that's only 64MB of data per row. If
you thin out older, less relevant data, you could last a lot longer
before you have to split rows. Furthermore, splitting old data off
into another row is easy because you know old data is not being
modified at the time of the split, so you don't have to worry about
the RMW problem or external locking of any kind. So I would start
without time-based sharding instead of over-engineering for it, it
makes everything else much simpler.

-- Ilya

P.S. Credit for the above view point goes to Ryan King, who made this
argument to me in a discussion we had recently about this exact
problem.

2010/4/14 Ted Zlatanov <t...@lifelogs.com>:
> On Wed, 14 Apr 2010 15:02:29 +0200 "Jean-Pierre Bergamin" <ja...@ractive.ch> 
> wrote:
>
> JB> The metrics are stored together with a timestamp. The queries we want to
> JB> perform are:
> JB>  * The last value of a specific metric of a device
> JB>  * The values of a specific metric of a device between two timestamps t1 
> and
> JB> t2
>
> Make your key "devicename-metricname-YYYYMMDD-HHMM" (with whatever time
> sharding makes sense to you; I use UTC by-hours and by-day in my
> environment).  Then your supercolumn is the collection time as a
> LongType and your columns inside the supercolumn can express the metric
> in detail (collector agent, detailed breakdown, etc.).
>
> If you want your clients to discover the available metrics, you may need
> to keep an external index.  But from your spec that doesn't seem necessary.
>
> Ted
>
>

Re: Time-series data model

Reply via email to