> Anyways, I want to store some (alot of) Time Series data in Cassandra
> and would like to check if my assumptions are correct so far. So if
> someone with operational experience could confirm these I'd really
> appreciate it.
>
> Basically the structure I'm going for right now looks like this:
>
> One CF with LongType Keys which represent a day (eg. 20100612,
> 20100613, ...). Each value is a simple Time Series which is just a
> list of 24 Integers (1 Counter for every Hour) packed into 96 bytes
> (24x4byte).
>
> Then I have alot of rows which each accumulate one column per day. Put
> in Web Analytics terms I might count the number of views a page gets:
>
> row:"/page/1" => cols:[20100612 => [12,34,...], 20100613 => [34,93,...], ...]
> row:"/page/2" => cols:[20100612 => [1,...], ...]

And the per hour counts are stored as json?

> Over a couple of years I would collect "millions" of rows, each with
> "hundreds" of columns.
>
> So, Assumption #1:
>
> Only the row key decides where the data lives (via consistent
> hashing)? So each tuple for a row lives on the same node(s) which in
> turn makes querying for slices of columns fast. I really need fast
> queries (It is fast in my tests but I'm working on a very small subset
> only).
>
> Assumption #2:
>
> Basically the only query for this CF will always look like "get <some
> date range> of data for <row>". I can actually just get a slice of
> columns using 'start' and 'count' and this would perform just as fast
> (or faster) than building my list of keys on the client and doing a
> multi get?
>
> Beware SQL! Translated to SQL (since this is what my brain does all the time):
> SELECT data FROM time_series WHERE key = '/page/1' ORDER BY day DESC LIMIT 90;
> vs
> SELECT data FROM time_series WHERE key = '/page/1' AND day IN
> ('2010-06-13', '2010-06-12', ...);
> vs
> memcache.get(['20100613:/page/1', '20100612:/page/1', ...])

I thought your row key is "/page/1" and the date is in the column? So
you want this?

 cassandra.get("/page/1", Slice("20100612"..."20100613"))

> Assumption #3:
>
> Since the data grows in a fixed rate per row and only the number of
> rows varies it should be simple enough to predict storage
> requirements. Rows are "equally" distributed on the cluster (using
> RandomPartitioner) and should a node reach its capacity limit the
> cluster will migrate rows to new nodes. Making it easy to scale out.
> Thats the point right? :P

I doubt you data will grow at a fixed rate per row. (Unless you have
always the same hit pattern for your pages) But you should be able to
able to calculated the maximal required storage requirement. That said
- I am wondering... where are you aggregating the counts per hour?

> Assumption #4:
>
> I might update the current day data multiple times until the day
> passes and the data becomes immutable. It is ok for Clients to see old
> data but the data must be "correct" at some point (eventually
> consistent ha!). This seems to be solved, just something the SQL Devil
> on my shoulder keeps bugging me about.

So you want to increment those counters per hit? I don't think there
is an atomic increment semantic in cassandra yet. (Some one else to
confirm?)

> I think I "got" it and will get my hands dirty soon, just wanted to
> squash my last doubts. I've done this on Riak too but I wasnt too
> happy with it.

Just wondering what felt wrong about Riak.

> Cassandra feels "right" although it took some Jedi Mind
> Tricks to grasp SuperColumns.

<rant>
TBH while we are using super columns, the somehow feel wrong to me. I
would be happier if we could move what we do with super columns into
the row key space. But in our case that does not seem to be so easy.
</rant>

cheers
--
Torsten

Reply via email to