Some of the sequences grow so fast that sub-partition is inevitable. I may need to try different bucket sizes to get the optimal throughput. Thank you all for the advice.
On Mon, Dec 8, 2014 at 9:55 AM, Eric Stevens <migh...@gmail.com> wrote: > The upper bound for the data size of a single column is 2GB, and the upper > bound for the number of columns in a row (partition) is 2 billion. So if > you wanted to create the largest possible row, you probably can't afford > enough disks to hold it. > http://wiki.apache.org/cassandra/CassandraLimitations > > Practically speaking you start running into troubles *way* before you > reach those thresholds though. Large columns and large numbers of columns > create GC pressure in your cluster, and since all data for a given row > reside on the same primary and replicas, this tends to lead to hot > spotting. Repair happens for entire rows, so large rows increase the cost > of repairs, including GC pressure during the repair. And rows of this size > are often arrived at by appending to the same row repeatedly, which will > cause the data for that row to be scattered across a large number of > SSTables which will hurt read performance. Also depending on your > interface, you'll find you start hitting limits that you have to increase, > each with their own implications (eg, maximum thrift message sizes and so > forth). The right maximum practical size for a row definitely depends on > your read and write patterns, as well as your hardware and network. More > memory, SSD's, larger SSTables, and faster networks will all raise the > ceiling for where large rows start to become painful. > > @Kai, if you're familiar with the Thrift paradigm, the partition key > equates to a Thrift row key, and the clustering key equates to the first > part of a composite column name. CQL PRIMARY KEY ((a,b), c, d) equates to > Thrift where row key is ['a:b'] and all columns begin with ['c:d:']. > Recommended reading: http://www.datastax.com/dev/blog/thrift-to-cql3 > > Whatever your partition key, if you need to sub-partition to maintain > reasonable row sizes, then the only way to preserve data locality for > related records is probably to switch to byte ordered partitioner, and > compute blob or long column as part of your partition key that is meant to > cause the PK to to map to the same token. Just be aware that byte ordered > partitioner comes with a number of caveats, and you'll become responsible > for maintaining good data load distributions in your cluster. But the > benefits from being able to tune locality may be worth it. > > > On Sun Dec 07 2014 at 3:12:11 PM Jonathan Haddad <j...@jonhaddad.com> > wrote: > >> I think he mentioned 100MB as the max size - planning for 1mb might make >> your data model difficult to work. >> >> On Sun Dec 07 2014 at 12:07:47 PM Kai Wang <dep...@gmail.com> wrote: >> >>> Thanks for the help. I wasn't clear how clustering column works. Coming >>> from Thrift experience, it took me a while to understand how clustering >>> column impacts partition storage on disk. Now I believe using seq_type as >>> the first clustering column solves my problem. As of partition size, I will >>> start with some bucket assumption. If the partition size exceeds the >>> threshold I may need to re-bucket using smaller bucket size. >>> >>> On another thread Eric mentions the optimal partition size should be at >>> 100 kb ~ 1 MB. I will use that as the start point to design my bucket >>> strategy. >>> >>> >>> On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky <j...@basetechnology.com >>> > wrote: >>> >>>> It would be helpful to look at some specific examples of sequences, >>>> showing how they grow. I suspect that the term “sequence” is being >>>> overloaded in some subtly misleading way here. >>>> >>>> Besides, we’ve already answered the headline question – data locality >>>> is achieved by having a common partition key. So, we need some clarity as >>>> to what question we are really focusing on >>>> >>>> And, of course, we should be asking the “Cassandra Data Modeling 101” >>>> question of what do your queries want to look like, how exactly do you want >>>> to access your data. Only after we have a handle on how you need to read >>>> your data can we decide how it should be stored. >>>> >>>> My immediate question to get things back on track: When you say “The >>>> typical read is to load a subset of sequences with the same seq_id”, >>>> what type of “subset” are you talking about? Again, a few explicit and >>>> concise example queries (in some concise, easy to read pseudo language or >>>> even plain English, but not belabored with full CQL syntax.) would be very >>>> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset” >>>> command, so what are we really talking about? >>>> >>>> Also, I presume we are talking CQL, but some of the references seem >>>> more Thrift/slice oriented. >>>> >>>> -- Jack Krupansky >>>> >>>> *From:* Eric Stevens <migh...@gmail.com> >>>> *Sent:* Sunday, December 7, 2014 10:12 AM >>>> *To:* user@cassandra.apache.org >>>> *Subject:* Re: How to model data to achieve specific data locality >>>> >>>> > Also new seq_types can be added and old seq_types can be deleted. >>>> This means I often need to ALTER TABLE to add and drop columns. >>>> >>>> Kai, unless I'm misunderstanding something, I don't see why you need to >>>> alter the table to add a new seq type. From a data model perspective, >>>> these are just new values in a row. >>>> >>>> If you do have columns which are specific to particular seq_types, data >>>> modeling does become a little more challenging. In that case you may get >>>> some advantage from using collections (especially map) to store data which >>>> applies to only a few seq types. Or defining a schema which includes the >>>> set of all possible columns (that's when you're getting into ALTERs when a >>>> new column comes or goes). >>>> >>>> > All sequences with the same seq_id tend to grow at the same rate. >>>> >>>> Note that it is an anti pattern in Cassandra to append to the same row >>>> indefinitely. I think you understand this because of your original >>>> question. But please note that a sub partitioning strategy which reuses >>>> subpartitions will result in degraded read performance after a while. >>>> You'll need to rotate sub partitions by something that doesn't repeat in >>>> order to keep the data for a given partition key grouped into just a few >>>> sstables. A typical pattern there is to use some kind of time bucket >>>> (hour, day, week, etc., depending on your write volume). >>>> >>>> I do note that your original question was about preserving data >>>> locality - and having a consistent locality for a given seq_id - for best >>>> offline analytics. If you wanted to work for this, you can certainly also >>>> include a blob value in your partitioning key, whose value is calculated to >>>> force a ring collision with this record's sibling data. With Cassandra's >>>> default partitioner of murmur3, that's probably pretty challenging - >>>> murmur3 isn't designed to be cryptographically strong (it doesn't work to >>>> make it difficult to force a collision), but it's meant to have good >>>> distribution (it may still be computationally expensive to force a >>>> collision - I'm not that familiar with its internal workings). In this >>>> case, ByteOrderedPartitioner would be a lot easier to force a ring >>>> collision on, but then you need to work on a good ring balancing strategy >>>> to distribute your data evenly over the ring. >>>> >>>> On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan <doanduy...@gmail.com> >>>> wrote: >>>> >>>>> "Those sequences are not fixed. All sequences with the same seq_id >>>>> tend to grow at the same rate. If it's one partition per seq_id, the size >>>>> will most likely exceed the threshold quickly" >>>>> >>>>> --> Then use bucketing to avoid too wide partitions >>>>> >>>>> "Also new seq_types can be added and old seq_types can be deleted. >>>>> This means I often need to ALTER TABLE to add and drop columns. I am not >>>>> sure if this is a good practice from operation point of view." >>>>> >>>>> --> I don't understand why altering table is necessary to add >>>>> seq_types. If "seq_types" is defined as your clustering column, you can >>>>> have many of them using the same table structure ... >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang <dep...@gmail.com> wrote: >>>>> >>>>>> On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <migh...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> It depends on the size of your data, but if your data is reasonably >>>>>>> small, there should be no trouble including thousands of records on the >>>>>>> same partition key. So a data model using PRIMARY KEY ((seq_id), >>>>>>> seq_type) >>>>>>> ought to work fine. >>>>>>> >>>>>>> If the data size per partition exceeds some threshold that >>>>>>> represents the right tradeoff of increasing repair cost, gc pressure, >>>>>>> threatening unbalanced loads, and other issues that come with wide >>>>>>> partitions, then you can subpartition via some means in a manner >>>>>>> consistent >>>>>>> with your work load, with something like PRIMARY KEY ((seq_id, >>>>>>> subpartition), seq_type). >>>>>>> >>>>>>> For example, if seq_type can be processed for a given seq_id in any >>>>>>> order, and you need to be able to locate specific records for a known >>>>>>> seq_id/seq_type pair, you can compute subpartition is computed >>>>>>> deterministically. Or if you only ever need to read *all* values >>>>>>> for a given seq_id, and the processing order is not important, just >>>>>>> randomly generate a value for subpartition at write time, as long as you >>>>>>> can know all possible values for subpartition. >>>>>>> >>>>>>> If the values for the seq_types for a given seq_id must always be >>>>>>> processed in order based on seq_type, then your subpartition calculation >>>>>>> would need to reflect that and place adjacent seq_types in the same >>>>>>> partition. As a contrived example, say seq_type was an incrementing >>>>>>> integer, your subpartition could be seq_type / 100. >>>>>>> >>>>>>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote: >>>>>>> >>>>>>>> I have a data model question. I am trying to figure out how to >>>>>>>> model the data to achieve the best data locality for analytic purpose. >>>>>>>> Our >>>>>>>> application processes sequences. Each sequence has a unique key in the >>>>>>>> format of [seq_id]_[seq_type]. For any given seq_id, there are >>>>>>>> unlimited >>>>>>>> number of seq_types. The typical read is to load a subset of sequences >>>>>>>> with >>>>>>>> the same seq_id. Naturally I would like to have all the sequences with >>>>>>>> the >>>>>>>> same seq_id to co-locate on the same node(s). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> However I can't simply create one partition per seq_id and use >>>>>>>> seq_id as my partition key. That's because: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 1. there could be thousands or even more seq_types for each seq_id. >>>>>>>> It's not feasible to include all the seq_types into one table. >>>>>>>> >>>>>>>> 2. each seq_id might have different sets of seq_types. >>>>>>>> >>>>>>>> 3. each application only needs to access a subset of seq_types for >>>>>>>> a seq_id. Based on CASSANDRA-5762, select partial row loads the whole >>>>>>>> row. >>>>>>>> I prefer only touching the data that's needed. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> As per above, I think I should use one partition per >>>>>>>> [seq_id]_[seq_type]. But how can I archive the data locality on >>>>>>>> seq_id? One >>>>>>>> possible approach is to override IPartitioner so that I just use part >>>>>>>> of >>>>>>>> the field (say 64 bytes) to get the token (for location) while still >>>>>>>> using >>>>>>>> the whole field as partition key (for look up). But before heading that >>>>>>>> direction, I would like to see if there are better options out there. >>>>>>>> Maybe >>>>>>>> any new or upcoming features in C* 3.0? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>> >>>>>> Thanks, Eric. >>>>>> >>>>>> Those sequences are not fixed. All sequences with the same seq_id >>>>>> tend to grow at the same rate. If it's one partition per seq_id, the size >>>>>> will most likely exceed the threshold quickly. Also new seq_types can be >>>>>> added and old seq_types can be deleted. This means I often need to ALTER >>>>>> TABLE to add and drop columns. I am not sure if this is a good practice >>>>>> from operation point of view. >>>>>> >>>>>> I thought about your subpartition idea. If there are only a few >>>>>> applications and each one of them uses a subset of seq_types, I can >>>>>> easily >>>>>> create one table per application since I can compute the subpartition >>>>>> deterministically as you said. But in my case data scientists need to >>>>>> easily write new applications using any combination of seq_types of a >>>>>> seq_id. So I want the data model to be flexible enough to support >>>>>> applications using any different set of seq_types without creating new >>>>>> tables, duplicate all the data etc. >>>>>> >>>>>> -Kai >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>