David -- thanks for the thoughts.

In re: your question
> Does the random partitioner support what you need?

I guess my answer is "I'm not sure yet", but also my initial thought
was that we'd use the (or a) OrderPreservingPartitioner so that we
could use range scans and that rows for a given entity would be
co-located (if I'm understanding Cassandra's storage architecture
properly).  But that may be a naive approach.

In our core data set, we have maybe 20,000 entities about which we are
storing time-series data (and its fairly well distributed across these
entities).  Occurs to me it's also possible to store a entity per row,
with the time-series data as (or in?) super columns (and maybe it
would make sense to break those out in column families by date range).
 I'd have to think through a little more what that might mean for our
secondary indexing needs.

Thanks,

dwh



On Wed, May 5, 2010 at 1:16 AM, David Strauss <da...@fourkitchens.com> wrote:
> On 2010-05-05 04:50, Denis Haskin wrote:
>> I've been reading everything I can get my hands on about Cassandra and
>> it sounds like a possibly very good framework for our data needs; I'm
>> about to take the plunge and do some prototyping, but I thought I'd
>> see if I can get a reality check here on whether it makes sense.
>>
>> Our schema should be fairly simple; we may only keep our original data
>> in Cassandra, and the rollups and analyzed results in a relational db
>> (although this is still open for discussion).
>
> This is what we do on some projects. This is a particularly nice
> strategy if the raw : aggregated ratio is really high or the raw data is
> bursty or highly volatile.
>
> Consider Hadoop integration for your aggregation needs.
>
>> We have fairly small records: 120-150 bytes, in maybe 18 columns.
>> Data is additive only; we would rarely, if ever, be deleting data.
>
> Cassandra loves you.
>
>> Our core data set will accumulate at somewhere between 14 and 27
>> million rows per day; we'll be starting with about a year and a half
>> of data (7.5 - 15 billion rows) and eventually would like to keep 5
>> years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
>> per year, data only.  Not sure about the overhead yet.)
>>
>> Ideally we'd like to also have a cluster with our complete data set,
>> which is maybe 38 billion rows per year (we could live with less than
>> 5 years of that).
>>
>> I haven't really thought through what the schema's going to be; our
>> primary key is an entity's ID plus a timestamp.  But there's 2 or 3
>> other retrieval paths we'll need to support as well.
>
> Generally, you do multiple retrieval paths through denormalization in
> Cassandra.
>
>> Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?
>
> Does the random partitioner support what you need?
>
> --
> David Strauss
>   | da...@fourkitchens.com
> Four Kitchens
>   | http://fourkitchens.com
>   | +1 512 454 6659 [office]
>   | +1 512 870 8453 [direct]
>
>

Reply via email to