Re: How to model data to achieve specific data locality

Eric Stevens Sat, 06 Dec 2014 08:19:35 -0800

It depends on the size of your data, but if your data is reasonably small,
there should be no trouble including thousands of records on the same
partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
ought to work fine.

If the data size per partition exceeds some threshold that represents the
right tradeoff of increasing repair cost, gc pressure, threatening
unbalanced loads, and other issues that come with wide partitions, then you
can subpartition via some means in a manner consistent with your work load,
with something like PRIMARY KEY ((seq_id, subpartition), seq_type).

For example, if seq_type can be processed for a given seq_id in any order,
and you need to be able to locate specific records for a known
seq_id/seq_type pair, you can compute subpartition is computed
deterministically.  Or if you only ever need to read *all* values for a
given seq_id, and the processing order is not important, just randomly
generate a value for subpartition at write time, as long as you can know
all possible values for subpartition.

If the values for the seq_types for a given seq_id must always be processed
in order based on seq_type, then your subpartition calculation would need
to reflect that and place adjacent seq_types in the same partition.  As a
contrived example, say seq_type was an incrementing integer, your
subpartition could be seq_type / 100.

On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote:

> I have a data model question. I am trying to figure out how to model the
> data to achieve the best data locality for analytic purpose. Our
> application processes sequences. Each sequence has a unique key in the
> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
> number of seq_types. The typical read is to load a subset of sequences with
> the same seq_id. Naturally I would like to have all the sequences with the
> same seq_id to co-locate on the same node(s).
>
>
> However I can't simply create one partition per seq_id and use seq_id as
> my partition key. That's because:
>
>
> 1. there could be thousands or even more seq_types for each seq_id. It's
> not feasible to include all the seq_types into one table.
>
> 2. each seq_id might have different sets of seq_types.
>
> 3. each application only needs to access a subset of seq_types for a
> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
> prefer only touching the data that's needed.
>
>
> As per above, I think I should use one partition per [seq_id]_[seq_type].
> But how can I archive the data locality on seq_id? One possible approach is
> to override IPartitioner so that I just use part of the field (say 64
> bytes) to get the token (for location) while still using the whole field as
> partition key (for look up). But before heading that direction, I would
> like to see if there are better options out there. Maybe any new or
> upcoming features in C* 3.0?
>
>
> Thanks.
>

Re: How to model data to achieve specific data locality

Reply via email to