On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <migh...@gmail.com> wrote:

> It depends on the size of your data, but if your data is reasonably small,
> there should be no trouble including thousands of records on the same
> partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
> ought to work fine.
>
> If the data size per partition exceeds some threshold that represents the
> right tradeoff of increasing repair cost, gc pressure, threatening
> unbalanced loads, and other issues that come with wide partitions, then you
> can subpartition via some means in a manner consistent with your work load,
> with something like PRIMARY KEY ((seq_id, subpartition), seq_type).
>
> For example, if seq_type can be processed for a given seq_id in any order,
> and you need to be able to locate specific records for a known
> seq_id/seq_type pair, you can compute subpartition is computed
> deterministically.  Or if you only ever need to read *all* values for a
> given seq_id, and the processing order is not important, just randomly
> generate a value for subpartition at write time, as long as you can know
> all possible values for subpartition.
>
> If the values for the seq_types for a given seq_id must always be
> processed in order based on seq_type, then your subpartition calculation
> would need to reflect that and place adjacent seq_types in the same
> partition.  As a contrived example, say seq_type was an incrementing
> integer, your subpartition could be seq_type / 100.
>
> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote:
>
>> I have a data model question. I am trying to figure out how to model the
>> data to achieve the best data locality for analytic purpose. Our
>> application processes sequences. Each sequence has a unique key in the
>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
>> number of seq_types. The typical read is to load a subset of sequences with
>> the same seq_id. Naturally I would like to have all the sequences with the
>> same seq_id to co-locate on the same node(s).
>>
>>
>> However I can't simply create one partition per seq_id and use seq_id as
>> my partition key. That's because:
>>
>>
>> 1. there could be thousands or even more seq_types for each seq_id. It's
>> not feasible to include all the seq_types into one table.
>>
>> 2. each seq_id might have different sets of seq_types.
>>
>> 3. each application only needs to access a subset of seq_types for a
>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
>> prefer only touching the data that's needed.
>>
>>
>> As per above, I think I should use one partition per [seq_id]_[seq_type].
>> But how can I archive the data locality on seq_id? One possible approach is
>> to override IPartitioner so that I just use part of the field (say 64
>> bytes) to get the token (for location) while still using the whole field as
>> partition key (for look up). But before heading that direction, I would
>> like to see if there are better options out there. Maybe any new or
>> upcoming features in C* 3.0?
>>
>>
>> Thanks.
>>
>
Thanks, Eric.

Those sequences are not fixed. All sequences with the same seq_id tend to
grow at the same rate. If it's one partition per seq_id, the size will most
likely exceed the threshold quickly. Also new seq_types can be added and
old seq_types can be deleted. This means I often need to ALTER TABLE to add
and drop columns. I am not sure if this is a good practice from operation
point of view.

I thought about your subpartition idea. If there are only a few
applications and each one of them uses a subset of seq_types, I can easily
create one table per application since I can compute the subpartition
deterministically as you said. But in my case data scientists need to
easily write new applications using any combination of seq_types of a
seq_id. So I want the data model to be flexible enough to support
applications using any different set of seq_types without creating new
tables, duplicate all the data etc.

-Kai

Reply via email to