Re: How to model data to achieve specific data locality

Kai Wang Sun, 07 Dec 2014 12:08:09 -0800

Thanks for the help. I wasn't clear how clustering column works. Coming
from Thrift experience, it took me a while to understand how clustering
column impacts partition storage on disk. Now I believe using seq_type as
the first clustering column solves my problem. As of partition size, I will
start with some bucket assumption. If the partition size exceeds the
threshold I may need to re-bucket using smaller bucket size.


On another thread Eric mentions the optimal partition size should be at 100
kb ~ 1 MB. I will use that as the start point to design my bucket strategy.


On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky <j...@basetechnology.com>
wrote:

>   It would be helpful to look at some specific examples of sequences,
> showing how they grow. I suspect that the term “sequence” is being
> overloaded in some subtly misleading way here.
>
> Besides, we’ve already answered the headline question – data locality is
> achieved by having a common partition key. So, we need some clarity as to
> what question we are really focusing on
>
> And, of course, we should be asking the “Cassandra Data Modeling 101”
> question of what do your queries want to look like, how exactly do you want
> to access your data. Only after we have a handle on how you need to read
> your data can we decide how it should be stored.
>
> My immediate question to get things back on track: When you say “The
> typical read is to load a subset of sequences with the same seq_id”, what
> type of “subset” are you talking about? Again, a few explicit and concise
> example queries (in some concise, easy to read pseudo language or even
> plain English, but not belabored with full CQL syntax.) would be very
> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset”
> command, so what are we really talking about?
>
> Also, I presume we are talking CQL, but some of the references seem more
> Thrift/slice oriented.
>
> -- Jack Krupansky
>
>  *From:* Eric Stevens <migh...@gmail.com>
> *Sent:* Sunday, December 7, 2014 10:12 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: How to model data to achieve specific data locality
>
> > Also new seq_types can be added and old seq_types can be deleted. This
> means I often need to ALTER TABLE to add and drop columns.
>
> Kai, unless I'm misunderstanding something, I don't see why you need to
> alter the table to add a new seq type.  From a data model perspective,
> these are just new values in a row.
>
> If you do have columns which are specific to particular seq_types, data
> modeling does become a little more challenging.  In that case you may get
> some advantage from using collections (especially map) to store data which
> applies to only a few seq types.  Or defining a schema which includes the
> set of all possible columns (that's when you're getting into ALTERs when a
> new column comes or goes).
>
> > All sequences with the same seq_id tend to grow at the same rate.
>
> Note that it is an anti pattern in Cassandra to append to the same row
> indefinitely.  I think you understand this because of your original
> question.  But please note that a sub partitioning strategy which reuses
> subpartitions will result in degraded read performance after a while.
> You'll need to rotate sub partitions by something that doesn't repeat in
> order to keep the data for a given partition key grouped into just a few
> sstables.  A typical pattern there is to use some kind of time bucket
> (hour, day, week, etc., depending on your write volume).
>
> I do note that your original question was about preserving data locality -
> and having a consistent locality for a given seq_id - for best offline
> analytics.  If you wanted to work for this, you can certainly also include
> a blob value in your partitioning key, whose value is calculated to force a
> ring collision with this record's sibling data.  With Cassandra's default
> partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
> designed to be cryptographically strong (it doesn't work to make it
> difficult to force a collision), but it's meant to have good distribution
> (it may still be computationally expensive to force a collision - I'm not
> that familiar with its internal workings).  In this case,
> ByteOrderedPartitioner would be a lot easier to force a ring collision on,
> but then you need to work on a good ring balancing strategy to distribute
> your data evenly over the ring.
>
> On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan <doanduy...@gmail.com> wrote:
>
>> "Those sequences are not fixed. All sequences with the same seq_id tend
>> to grow at the same rate. If it's one partition per seq_id, the size will
>> most likely exceed the threshold quickly"
>>
>>  --> Then use bucketing to avoid too wide partitions
>>
>> "Also new seq_types can be added and old seq_types can be deleted. This
>> means I often need to ALTER TABLE to add and drop columns. I am not sure if
>> this is a good practice from operation point of view."
>>
>>  --> I don't understand why altering table is necessary to add
>> seq_types. If "seq_types" is defined as your clustering column, you can
>> have many of them using the same table structure ...
>>
>>
>>
>>
>>
>> On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang <dep...@gmail.com> wrote:
>>
>>>   On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <migh...@gmail.com>
>>> wrote:
>>>
>>>> It depends on the size of your data, but if your data is reasonably
>>>> small, there should be no trouble including thousands of records on the
>>>> same partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
>>>> ought to work fine.
>>>>
>>>> If the data size per partition exceeds some threshold that represents
>>>> the right tradeoff of increasing repair cost, gc pressure, threatening
>>>> unbalanced loads, and other issues that come with wide partitions, then you
>>>> can subpartition via some means in a manner consistent with your work load,
>>>> with something like PRIMARY KEY ((seq_id, subpartition), seq_type).
>>>>
>>>> For example, if seq_type can be processed for a given seq_id in any
>>>> order, and you need to be able to locate specific records for a known
>>>> seq_id/seq_type pair, you can compute subpartition is computed
>>>> deterministically.  Or if you only ever need to read *all* values for
>>>> a given seq_id, and the processing order is not important, just randomly
>>>> generate a value for subpartition at write time, as long as you can know
>>>> all possible values for subpartition.
>>>>
>>>> If the values for the seq_types for a given seq_id must always be
>>>> processed in order based on seq_type, then your subpartition calculation
>>>> would need to reflect that and place adjacent seq_types in the same
>>>> partition.  As a contrived example, say seq_type was an incrementing
>>>> integer, your subpartition could be seq_type / 100.
>>>>
>>>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote:
>>>>
>>>>>  I have a data model question. I am trying to figure out how to model
>>>>> the data to achieve the best data locality for analytic purpose. Our
>>>>> application processes sequences. Each sequence has a unique key in the
>>>>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
>>>>> number of seq_types. The typical read is to load a subset of sequences 
>>>>> with
>>>>> the same seq_id. Naturally I would like to have all the sequences with the
>>>>> same seq_id to co-locate on the same node(s).
>>>>>
>>>>>
>>>>>
>>>>> However I can't simply create one partition per seq_id and use seq_id
>>>>> as my partition key. That's because:
>>>>>
>>>>>
>>>>>
>>>>> 1. there could be thousands or even more seq_types for each seq_id.
>>>>> It's not feasible to include all the seq_types into one table.
>>>>>
>>>>> 2. each seq_id might have different sets of seq_types.
>>>>>
>>>>> 3. each application only needs to access a subset of seq_types for a
>>>>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
>>>>> prefer only touching the data that's needed.
>>>>>
>>>>>
>>>>>
>>>>> As per above, I think I should use one partition per
>>>>> [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? 
>>>>> One
>>>>> possible approach is to override IPartitioner so that I just use part of
>>>>> the field (say 64 bytes) to get the token (for location) while still using
>>>>> the whole field as partition key (for look up). But before heading that
>>>>> direction, I would like to see if there are better options out there. 
>>>>> Maybe
>>>>> any new or upcoming features in C* 3.0?
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>> Thanks, Eric.
>>>
>>> Those sequences are not fixed. All sequences with the same seq_id tend
>>> to grow at the same rate. If it's one partition per seq_id, the size will
>>> most likely exceed the threshold quickly. Also new seq_types can be added
>>> and old seq_types can be deleted. This means I often need to ALTER TABLE to
>>> add and drop columns. I am not sure if this is a good practice from
>>> operation point of view.
>>>
>>> I thought about your subpartition idea. If there are only a few
>>> applications and each one of them uses a subset of seq_types, I can easily
>>> create one table per application since I can compute the subpartition
>>> deterministically as you said. But in my case data scientists need to
>>> easily write new applications using any combination of seq_types of a
>>> seq_id. So I want the data model to be flexible enough to support
>>> applications using any different set of seq_types without creating new
>>> tables, duplicate all the data etc.
>>>
>>> -Kai
>>>
>>>
>>>
>>
>

Re: How to model data to achieve specific data locality

Reply via email to