Re: How to model data to achieve specific data locality

Kai Wang Tue, 09 Dec 2014 05:43:17 -0800

Some of the sequences grow so fast that sub-partition is inevitable. I may
need to try different bucket sizes to get the optimal throughput. Thank you
all for the advice.


On Mon, Dec 8, 2014 at 9:55 AM, Eric Stevens <migh...@gmail.com> wrote:

> The upper bound for the data size of a single column is 2GB, and the upper
> bound for the number of columns in a row (partition) is 2 billion.  So if
> you wanted to create the largest possible row, you probably can't afford
> enough disks to hold it.
> http://wiki.apache.org/cassandra/CassandraLimitations
>
> Practically speaking you start running into troubles *way* before you
> reach those thresholds though.  Large columns and large numbers of columns
> create GC pressure in your cluster, and since all data for a given row
> reside on the same primary and replicas, this tends to lead to hot
> spotting.  Repair happens for entire rows, so large rows increase the cost
> of repairs, including GC pressure during the repair.  And rows of this size
> are often arrived at by appending to the same row repeatedly, which will
> cause the data for that row to be scattered across a large number of
> SSTables which will hurt read performance. Also depending on your
> interface, you'll find you start hitting limits that you have to increase,
> each with their own implications (eg, maximum thrift message sizes and so
> forth).  The right maximum practical size for a row definitely depends on
> your read and write patterns, as well as your hardware and network.  More
> memory, SSD's, larger SSTables, and faster networks will all raise the
> ceiling for where large rows start to become painful.
>
> @Kai, if you're familiar with the Thrift paradigm, the partition key
> equates to a Thrift row key, and the clustering key equates to the first
> part of a composite column name.  CQL PRIMARY KEY ((a,b), c, d) equates to
> Thrift where row key is ['a:b'] and all columns begin with ['c:d:'].
> Recommended reading: http://www.datastax.com/dev/blog/thrift-to-cql3
>
> Whatever your partition key, if you need to sub-partition to maintain
> reasonable row sizes, then the only way to preserve data locality for
> related records is probably to switch to byte ordered partitioner, and
> compute blob or long column as part of your partition key that is meant to
> cause the PK to to map to the same token.  Just be aware that byte ordered
> partitioner comes with a number of caveats, and you'll become responsible
> for maintaining good data load distributions in your cluster. But the
> benefits from being able to tune locality may be worth it.
>
>
> On Sun Dec 07 2014 at 3:12:11 PM Jonathan Haddad <j...@jonhaddad.com>
> wrote:
>
>> I think he mentioned 100MB as the max size - planning for 1mb might make
>> your data model difficult to work.
>>
>> On Sun Dec 07 2014 at 12:07:47 PM Kai Wang <dep...@gmail.com> wrote:
>>
>>> Thanks for the help. I wasn't clear how clustering column works. Coming
>>> from Thrift experience, it took me a while to understand how clustering
>>> column impacts partition storage on disk. Now I believe using seq_type as
>>> the first clustering column solves my problem. As of partition size, I will
>>> start with some bucket assumption. If the partition size exceeds the
>>> threshold I may need to re-bucket using smaller bucket size.
>>>
>>> On another thread Eric mentions the optimal partition size should be at
>>> 100 kb ~ 1 MB. I will use that as the start point to design my bucket
>>> strategy.
>>>
>>>
>>> On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky <j...@basetechnology.com
>>> > wrote:
>>>
>>>>   It would be helpful to look at some specific examples of sequences,
>>>> showing how they grow. I suspect that the term “sequence” is being
>>>> overloaded in some subtly misleading way here.
>>>>
>>>> Besides, we’ve already answered the headline question – data locality
>>>> is achieved by having a common partition key. So, we need some clarity as
>>>> to what question we are really focusing on
>>>>
>>>> And, of course, we should be asking the “Cassandra Data Modeling 101”
>>>> question of what do your queries want to look like, how exactly do you want
>>>> to access your data. Only after we have a handle on how you need to read
>>>> your data can we decide how it should be stored.
>>>>
>>>> My immediate question to get things back on track: When you say “The
>>>> typical read is to load a subset of sequences with the same seq_id”,
>>>> what type of “subset” are you talking about? Again, a few explicit and
>>>> concise example queries (in some concise, easy to read pseudo language or
>>>> even plain English, but not belabored with full CQL syntax.) would be very
>>>> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset”
>>>> command, so what are we really talking about?
>>>>
>>>> Also, I presume we are talking CQL, but some of the references seem
>>>> more Thrift/slice oriented.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>>  *From:* Eric Stevens <migh...@gmail.com>
>>>> *Sent:* Sunday, December 7, 2014 10:12 AM
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: How to model data to achieve specific data locality
>>>>
>>>> > Also new seq_types can be added and old seq_types can be deleted.
>>>> This means I often need to ALTER TABLE to add and drop columns.
>>>>
>>>> Kai, unless I'm misunderstanding something, I don't see why you need to
>>>> alter the table to add a new seq type.  From a data model perspective,
>>>> these are just new values in a row.
>>>>
>>>> If you do have columns which are specific to particular seq_types, data
>>>> modeling does become a little more challenging.  In that case you may get
>>>> some advantage from using collections (especially map) to store data which
>>>> applies to only a few seq types.  Or defining a schema which includes the
>>>> set of all possible columns (that's when you're getting into ALTERs when a
>>>> new column comes or goes).
>>>>
>>>> > All sequences with the same seq_id tend to grow at the same rate.
>>>>
>>>> Note that it is an anti pattern in Cassandra to append to the same row
>>>> indefinitely.  I think you understand this because of your original
>>>> question.  But please note that a sub partitioning strategy which reuses
>>>> subpartitions will result in degraded read performance after a while.
>>>> You'll need to rotate sub partitions by something that doesn't repeat in
>>>> order to keep the data for a given partition key grouped into just a few
>>>> sstables.  A typical pattern there is to use some kind of time bucket
>>>> (hour, day, week, etc., depending on your write volume).
>>>>
>>>> I do note that your original question was about preserving data
>>>> locality - and having a consistent locality for a given seq_id - for best
>>>> offline analytics.  If you wanted to work for this, you can certainly also
>>>> include a blob value in your partitioning key, whose value is calculated to
>>>> force a ring collision with this record's sibling data.  With Cassandra's
>>>> default partitioner of murmur3, that's probably pretty challenging -
>>>> murmur3 isn't designed to be cryptographically strong (it doesn't work to
>>>> make it difficult to force a collision), but it's meant to have good
>>>> distribution (it may still be computationally expensive to force a
>>>> collision - I'm not that familiar with its internal workings).  In this
>>>> case, ByteOrderedPartitioner would be a lot easier to force a ring
>>>> collision on, but then you need to work on a good ring balancing strategy
>>>> to distribute your data evenly over the ring.
>>>>
>>>> On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan <doanduy...@gmail.com>
>>>> wrote:
>>>>
>>>>> "Those sequences are not fixed. All sequences with the same seq_id
>>>>> tend to grow at the same rate. If it's one partition per seq_id, the size
>>>>> will most likely exceed the threshold quickly"
>>>>>
>>>>>  --> Then use bucketing to avoid too wide partitions
>>>>>
>>>>> "Also new seq_types can be added and old seq_types can be deleted.
>>>>> This means I often need to ALTER TABLE to add and drop columns. I am not
>>>>> sure if this is a good practice from operation point of view."
>>>>>
>>>>>  --> I don't understand why altering table is necessary to add
>>>>> seq_types. If "seq_types" is defined as your clustering column, you can
>>>>> have many of them using the same table structure ...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang <dep...@gmail.com> wrote:
>>>>>
>>>>>>   On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <migh...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It depends on the size of your data, but if your data is reasonably
>>>>>>> small, there should be no trouble including thousands of records on the
>>>>>>> same partition key.  So a data model using PRIMARY KEY ((seq_id), 
>>>>>>> seq_type)
>>>>>>> ought to work fine.
>>>>>>>
>>>>>>> If the data size per partition exceeds some threshold that
>>>>>>> represents the right tradeoff of increasing repair cost, gc pressure,
>>>>>>> threatening unbalanced loads, and other issues that come with wide
>>>>>>> partitions, then you can subpartition via some means in a manner 
>>>>>>> consistent
>>>>>>> with your work load, with something like PRIMARY KEY ((seq_id,
>>>>>>> subpartition), seq_type).
>>>>>>>
>>>>>>> For example, if seq_type can be processed for a given seq_id in any
>>>>>>> order, and you need to be able to locate specific records for a known
>>>>>>> seq_id/seq_type pair, you can compute subpartition is computed
>>>>>>> deterministically.  Or if you only ever need to read *all* values
>>>>>>> for a given seq_id, and the processing order is not important, just
>>>>>>> randomly generate a value for subpartition at write time, as long as you
>>>>>>> can know all possible values for subpartition.
>>>>>>>
>>>>>>> If the values for the seq_types for a given seq_id must always be
>>>>>>> processed in order based on seq_type, then your subpartition calculation
>>>>>>> would need to reflect that and place adjacent seq_types in the same
>>>>>>> partition.  As a contrived example, say seq_type was an incrementing
>>>>>>> integer, your subpartition could be seq_type / 100.
>>>>>>>
>>>>>>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote:
>>>>>>>
>>>>>>>>  I have a data model question. I am trying to figure out how to
>>>>>>>> model the data to achieve the best data locality for analytic purpose. 
>>>>>>>> Our
>>>>>>>> application processes sequences. Each sequence has a unique key in the
>>>>>>>> format of [seq_id]_[seq_type]. For any given seq_id, there are 
>>>>>>>> unlimited
>>>>>>>> number of seq_types. The typical read is to load a subset of sequences 
>>>>>>>> with
>>>>>>>> the same seq_id. Naturally I would like to have all the sequences with 
>>>>>>>> the
>>>>>>>> same seq_id to co-locate on the same node(s).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> However I can't simply create one partition per seq_id and use
>>>>>>>> seq_id as my partition key. That's because:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1. there could be thousands or even more seq_types for each seq_id.
>>>>>>>> It's not feasible to include all the seq_types into one table.
>>>>>>>>
>>>>>>>> 2. each seq_id might have different sets of seq_types.
>>>>>>>>
>>>>>>>> 3. each application only needs to access a subset of seq_types for
>>>>>>>> a seq_id. Based on CASSANDRA-5762, select partial row loads the whole 
>>>>>>>> row.
>>>>>>>> I prefer only touching the data that's needed.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> As per above, I think I should use one partition per
>>>>>>>> [seq_id]_[seq_type]. But how can I archive the data locality on 
>>>>>>>> seq_id? One
>>>>>>>> possible approach is to override IPartitioner so that I just use part 
>>>>>>>> of
>>>>>>>> the field (say 64 bytes) to get the token (for location) while still 
>>>>>>>> using
>>>>>>>> the whole field as partition key (for look up). But before heading that
>>>>>>>> direction, I would like to see if there are better options out there. 
>>>>>>>> Maybe
>>>>>>>> any new or upcoming features in C* 3.0?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>
>>>>>> Thanks, Eric.
>>>>>>
>>>>>> Those sequences are not fixed. All sequences with the same seq_id
>>>>>> tend to grow at the same rate. If it's one partition per seq_id, the size
>>>>>> will most likely exceed the threshold quickly. Also new seq_types can be
>>>>>> added and old seq_types can be deleted. This means I often need to ALTER
>>>>>> TABLE to add and drop columns. I am not sure if this is a good practice
>>>>>> from operation point of view.
>>>>>>
>>>>>> I thought about your subpartition idea. If there are only a few
>>>>>> applications and each one of them uses a subset of seq_types, I can 
>>>>>> easily
>>>>>> create one table per application since I can compute the subpartition
>>>>>> deterministically as you said. But in my case data scientists need to
>>>>>> easily write new applications using any combination of seq_types of a
>>>>>> seq_id. So I want the data model to be flexible enough to support
>>>>>> applications using any different set of seq_types without creating new
>>>>>> tables, duplicate all the data etc.
>>>>>>
>>>>>> -Kai
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Re: How to model data to achieve specific data locality

Reply via email to