On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <migh...@gmail.com> wrote:
> It depends on the size of your data, but if your data is reasonably small, > there should be no trouble including thousands of records on the same > partition key. So a data model using PRIMARY KEY ((seq_id), seq_type) > ought to work fine. > > If the data size per partition exceeds some threshold that represents the > right tradeoff of increasing repair cost, gc pressure, threatening > unbalanced loads, and other issues that come with wide partitions, then you > can subpartition via some means in a manner consistent with your work load, > with something like PRIMARY KEY ((seq_id, subpartition), seq_type). > > For example, if seq_type can be processed for a given seq_id in any order, > and you need to be able to locate specific records for a known > seq_id/seq_type pair, you can compute subpartition is computed > deterministically. Or if you only ever need to read *all* values for a > given seq_id, and the processing order is not important, just randomly > generate a value for subpartition at write time, as long as you can know > all possible values for subpartition. > > If the values for the seq_types for a given seq_id must always be > processed in order based on seq_type, then your subpartition calculation > would need to reflect that and place adjacent seq_types in the same > partition. As a contrived example, say seq_type was an incrementing > integer, your subpartition could be seq_type / 100. > > On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote: > >> I have a data model question. I am trying to figure out how to model the >> data to achieve the best data locality for analytic purpose. Our >> application processes sequences. Each sequence has a unique key in the >> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited >> number of seq_types. The typical read is to load a subset of sequences with >> the same seq_id. Naturally I would like to have all the sequences with the >> same seq_id to co-locate on the same node(s). >> >> >> However I can't simply create one partition per seq_id and use seq_id as >> my partition key. That's because: >> >> >> 1. there could be thousands or even more seq_types for each seq_id. It's >> not feasible to include all the seq_types into one table. >> >> 2. each seq_id might have different sets of seq_types. >> >> 3. each application only needs to access a subset of seq_types for a >> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I >> prefer only touching the data that's needed. >> >> >> As per above, I think I should use one partition per [seq_id]_[seq_type]. >> But how can I archive the data locality on seq_id? One possible approach is >> to override IPartitioner so that I just use part of the field (say 64 >> bytes) to get the token (for location) while still using the whole field as >> partition key (for look up). But before heading that direction, I would >> like to see if there are better options out there. Maybe any new or >> upcoming features in C* 3.0? >> >> >> Thanks. >> > Thanks, Eric. Those sequences are not fixed. All sequences with the same seq_id tend to grow at the same rate. If it's one partition per seq_id, the size will most likely exceed the threshold quickly. Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. I am not sure if this is a good practice from operation point of view. I thought about your subpartition idea. If there are only a few applications and each one of them uses a subset of seq_types, I can easily create one table per application since I can compute the subpartition deterministically as you said. But in my case data scientists need to easily write new applications using any combination of seq_types of a seq_id. So I want the data model to be flexible enough to support applications using any different set of seq_types without creating new tables, duplicate all the data etc. -Kai