I have a data model question. I am trying to figure out how to model the
data to achieve the best data locality for analytic purpose. Our
application processes sequences. Each sequence has a unique key in the
format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
number of seq_types. The typical read is to load a subset of sequences with
the same seq_id. Naturally I would like to have all the sequences with the
same seq_id to co-locate on the same node(s).


However I can't simply create one partition per seq_id and use seq_id as my
partition key. That's because:


1. there could be thousands or even more seq_types for each seq_id. It's
not feasible to include all the seq_types into one table.

2. each seq_id might have different sets of seq_types.

3. each application only needs to access a subset of seq_types for a
seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
prefer only touching the data that's needed.


As per above, I think I should use one partition per [seq_id]_[seq_type].
But how can I archive the data locality on seq_id? One possible approach is
to override IPartitioner so that I just use part of the field (say 64
bytes) to get the token (for location) while still using the whole field as
partition key (for look up). But before heading that direction, I would
like to see if there are better options out there. Maybe any new or
upcoming features in C* 3.0?


Thanks.

Reply via email to