Re: Time-series data model

Ted Zlatanov Thu, 15 Apr 2010 07:04:45 -0700

On Thu, 15 Apr 2010 11:27:47 +0200 Jean-Pierre Bergamin <ja...@ractive.ch> 
wrote:

JB> Am 14.04.2010 15:22, schrieb Ted Zlatanov:
>> On Wed, 14 Apr 2010 15:02:29 +0200 "Jean-Pierre Bergamin"<ja...@ractive.ch>  
>> wrote:
>> 
JB> The metrics are stored together with a timestamp. The queries we want to
JB> perform are:
JB> * The last value of a specific metric of a device
JB> * The values of a specific metric of a device between two timestamps t1 and
JB> t2
>> 
>> Make your key "devicename-metricname-YYYYMMDD-HHMM" (with whatever time
>> sharding makes sense to you; I use UTC by-hours and by-day in my
>> environment).  Then your supercolumn is the collection time as a
>> LongType and your columns inside the supercolumn can express the metric
>> in detail (collector agent, detailed breakdown, etc.).
>> 
JB> Just for my understanding. What is "time sharding"? I couldn't find an
JB> explanation somewhere. Do you mean that the time-series data is rolled
JB> up in 5 minues, 1 hour, 1 day etc. slices?

Yes.  The usual meaning of "shard" in RDBMS world is to segment your
database by some criteria, e.g. US vs. Europe in Amazon AWS because
their data centers are laid out so.  I was taking a linguistic shortcut
to mean "break down your rows by some convenient criteria."  You can
actually set up your Partitioner in Cassandra to literally shard your
keyspace rows based on the key, but I just meant "slice" in my note.

JB> So this would be defined as:
JB> <ColumnFamily Name="measurements" ColumnType="Super"
JB> CompareWith="UTF8Type"  CompareSubcolumnsWith="LongType" />

JB> So when i want to read all values of one metric between two timestamps
JB> t0 and t1, I'd have to read the supercolumns that match a key range
JB> (device1:metric1:t0 - device1:metric1:t1) and then all the
JB> supercolumns for this key?

Yes.  This is a single multiget if you can construct the key range
explicitly.  Cassandra loads a lot of this in memory already and filters
it after the fact, that's why it pays to slice your keys and to stitch
them together on the client side if you have to go across a time
boundary.  You'll also get better key load balancing with deeper slicing
if you use the randomizing partitioner.

In the result set, you'll get each matching supercolumn with all the
columns inside it.  You may have to page through supercolumns.

Ted

Re: Time-series data model

Reply via email to