Re: How to model data to achieve specific data locality

Eric Stevens Sun, 07 Dec 2014 07:15:06 -0800

> Also new seq_types can be added and old seq_types can be deleted. This
means I often need to ALTER TABLE to add and drop columns.


Kai, unless I'm misunderstanding something, I don't see why you need to
alter the table to add a new seq type.  From a data model perspective,
these are just new values in a row.

If you do have columns which are specific to particular seq_types, data
modeling does become a little more challenging.  In that case you may get
some advantage from using collections (especially map) to store data which
applies to only a few seq types.  Or defining a schema which includes the
set of all possible columns (that's when you're getting into ALTERs when a
new column comes or goes).

> All sequences with the same seq_id tend to grow at the same rate.

Note that it is an anti pattern in Cassandra to append to the same row
indefinitely.  I think you understand this because of your original
question.  But please note that a sub partitioning strategy which reuses
subpartitions will result in degraded read performance after a while.
You'll need to rotate sub partitions by something that doesn't repeat in
order to keep the data for a given partition key grouped into just a few
sstables.  A typical pattern there is to use some kind of time bucket
(hour, day, week, etc., depending on your write volume).

I do note that your original question was about preserving data locality -
and having a consistent locality for a given seq_id - for best offline
analytics.  If you wanted to work for this, you can certainly also include
a blob value in your partitioning key, whose value is calculated to force a
ring collision with this record's sibling data.  With Cassandra's default
partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
designed to be cryptographically strong (it doesn't work to make it
difficult to force a collision), but it's meant to have good distribution
(it may still be computationally expensive to force a collision - I'm not
that familiar with its internal workings).  In this case,
ByteOrderedPartitioner would be a lot easier to force a ring collision on,
but then you need to work on a good ring balancing strategy to distribute
your data evenly over the ring.

On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan <doanduy...@gmail.com> wrote:

> "Those sequences are not fixed. All sequences with the same seq_id tend
> to grow at the same rate. If it's one partition per seq_id, the size will
> most likely exceed the threshold quickly"
>
> --> Then use bucketing to avoid too wide partitions
>
> "Also new seq_types can be added and old seq_types can be deleted. This
> means I often need to ALTER TABLE to add and drop columns. I am not sure if
> this is a good practice from operation point of view."
>
>  --> I don't understand why altering table is necessary to add seq_types.
> If "seq_types" is defined as your clustering column, you can have many of
> them using the same table structure ...
>
>
>
>
>
> On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang <dep...@gmail.com> wrote:
>
>> On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <migh...@gmail.com> wrote:
>>
>>> It depends on the size of your data, but if your data is reasonably
>>> small, there should be no trouble including thousands of records on the
>>> same partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
>>> ought to work fine.
>>>
>>> If the data size per partition exceeds some threshold that represents
>>> the right tradeoff of increasing repair cost, gc pressure, threatening
>>> unbalanced loads, and other issues that come with wide partitions, then you
>>> can subpartition via some means in a manner consistent with your work load,
>>> with something like PRIMARY KEY ((seq_id, subpartition), seq_type).
>>>
>>> For example, if seq_type can be processed for a given seq_id in any
>>> order, and you need to be able to locate specific records for a known
>>> seq_id/seq_type pair, you can compute subpartition is computed
>>> deterministically.  Or if you only ever need to read *all* values for a
>>> given seq_id, and the processing order is not important, just randomly
>>> generate a value for subpartition at write time, as long as you can know
>>> all possible values for subpartition.
>>>
>>> If the values for the seq_types for a given seq_id must always be
>>> processed in order based on seq_type, then your subpartition calculation
>>> would need to reflect that and place adjacent seq_types in the same
>>> partition.  As a contrived example, say seq_type was an incrementing
>>> integer, your subpartition could be seq_type / 100.
>>>
>>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote:
>>>
>>>> I have a data model question. I am trying to figure out how to model
>>>> the data to achieve the best data locality for analytic purpose. Our
>>>> application processes sequences. Each sequence has a unique key in the
>>>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
>>>> number of seq_types. The typical read is to load a subset of sequences with
>>>> the same seq_id. Naturally I would like to have all the sequences with the
>>>> same seq_id to co-locate on the same node(s).
>>>>
>>>>
>>>> However I can't simply create one partition per seq_id and use seq_id
>>>> as my partition key. That's because:
>>>>
>>>>
>>>> 1. there could be thousands or even more seq_types for each seq_id.
>>>> It's not feasible to include all the seq_types into one table.
>>>>
>>>> 2. each seq_id might have different sets of seq_types.
>>>>
>>>> 3. each application only needs to access a subset of seq_types for a
>>>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
>>>> prefer only touching the data that's needed.
>>>>
>>>>
>>>> As per above, I think I should use one partition per
>>>> [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? One
>>>> possible approach is to override IPartitioner so that I just use part of
>>>> the field (say 64 bytes) to get the token (for location) while still using
>>>> the whole field as partition key (for look up). But before heading that
>>>> direction, I would like to see if there are better options out there. Maybe
>>>> any new or upcoming features in C* 3.0?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>
>> Thanks, Eric.
>>
>> Those sequences are not fixed. All sequences with the same seq_id tend to
>> grow at the same rate. If it's one partition per seq_id, the size will most
>> likely exceed the threshold quickly. Also new seq_types can be added and
>> old seq_types can be deleted. This means I often need to ALTER TABLE to add
>> and drop columns. I am not sure if this is a good practice from operation
>> point of view.
>>
>> I thought about your subpartition idea. If there are only a few
>> applications and each one of them uses a subset of seq_types, I can easily
>> create one table per application since I can compute the subpartition
>> deterministically as you said. But in my case data scientists need to
>> easily write new applications using any combination of seq_types of a
>> seq_id. So I want the data model to be flexible enough to support
>> applications using any different set of seq_types without creating new
>> tables, duplicate all the data etc.
>>
>> -Kai
>>
>>
>>
>

Re: How to model data to achieve specific data locality

Reply via email to