I have a data model question. I am trying to figure out how to model the data to achieve the best data locality for analytic purpose. Our application processes sequences. Each sequence has a unique key in the format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited number of seq_types. The typical read is to load a subset of sequences with the same seq_id. Naturally I would like to have all the sequences with the same seq_id to co-locate on the same node(s).
However I can't simply create one partition per seq_id and use seq_id as my partition key. That's because: 1. there could be thousands or even more seq_types for each seq_id. It's not feasible to include all the seq_types into one table. 2. each seq_id might have different sets of seq_types. 3. each application only needs to access a subset of seq_types for a seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I prefer only touching the data that's needed. As per above, I think I should use one partition per [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? One possible approach is to override IPartitioner so that I just use part of the field (say 64 bytes) to get the token (for location) while still using the whole field as partition key (for look up). But before heading that direction, I would like to see if there are better options out there. Maybe any new or upcoming features in C* 3.0? Thanks.