It depends on the size of your data, but if your data is reasonably small, there should be no trouble including thousands of records on the same partition key. So a data model using PRIMARY KEY ((seq_id), seq_type) ought to work fine.
If the data size per partition exceeds some threshold that represents the right tradeoff of increasing repair cost, gc pressure, threatening unbalanced loads, and other issues that come with wide partitions, then you can subpartition via some means in a manner consistent with your work load, with something like PRIMARY KEY ((seq_id, subpartition), seq_type). For example, if seq_type can be processed for a given seq_id in any order, and you need to be able to locate specific records for a known seq_id/seq_type pair, you can compute subpartition is computed deterministically. Or if you only ever need to read *all* values for a given seq_id, and the processing order is not important, just randomly generate a value for subpartition at write time, as long as you can know all possible values for subpartition. If the values for the seq_types for a given seq_id must always be processed in order based on seq_type, then your subpartition calculation would need to reflect that and place adjacent seq_types in the same partition. As a contrived example, say seq_type was an incrementing integer, your subpartition could be seq_type / 100. On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote: > I have a data model question. I am trying to figure out how to model the > data to achieve the best data locality for analytic purpose. Our > application processes sequences. Each sequence has a unique key in the > format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited > number of seq_types. The typical read is to load a subset of sequences with > the same seq_id. Naturally I would like to have all the sequences with the > same seq_id to co-locate on the same node(s). > > > However I can't simply create one partition per seq_id and use seq_id as > my partition key. That's because: > > > 1. there could be thousands or even more seq_types for each seq_id. It's > not feasible to include all the seq_types into one table. > > 2. each seq_id might have different sets of seq_types. > > 3. each application only needs to access a subset of seq_types for a > seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I > prefer only touching the data that's needed. > > > As per above, I think I should use one partition per [seq_id]_[seq_type]. > But how can I archive the data locality on seq_id? One possible approach is > to override IPartitioner so that I just use part of the field (say 64 > bytes) to get the token (for location) while still using the whole field as > partition key (for look up). But before heading that direction, I would > like to see if there are better options out there. Maybe any new or > upcoming features in C* 3.0? > > > Thanks. >