Thanks for the help. I wasn't clear how clustering column works. Coming from Thrift experience, it took me a while to understand how clustering column impacts partition storage on disk. Now I believe using seq_type as the first clustering column solves my problem. As of partition size, I will start with some bucket assumption. If the partition size exceeds the threshold I may need to re-bucket using smaller bucket size.
On another thread Eric mentions the optimal partition size should be at 100 kb ~ 1 MB. I will use that as the start point to design my bucket strategy. On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky <j...@basetechnology.com> wrote: > It would be helpful to look at some specific examples of sequences, > showing how they grow. I suspect that the term “sequence” is being > overloaded in some subtly misleading way here. > > Besides, we’ve already answered the headline question – data locality is > achieved by having a common partition key. So, we need some clarity as to > what question we are really focusing on > > And, of course, we should be asking the “Cassandra Data Modeling 101” > question of what do your queries want to look like, how exactly do you want > to access your data. Only after we have a handle on how you need to read > your data can we decide how it should be stored. > > My immediate question to get things back on track: When you say “The > typical read is to load a subset of sequences with the same seq_id”, what > type of “subset” are you talking about? Again, a few explicit and concise > example queries (in some concise, easy to read pseudo language or even > plain English, but not belabored with full CQL syntax.) would be very > helpful. I mean, Cassandra has no “subset” concept, nor a “load subset” > command, so what are we really talking about? > > Also, I presume we are talking CQL, but some of the references seem more > Thrift/slice oriented. > > -- Jack Krupansky > > *From:* Eric Stevens <migh...@gmail.com> > *Sent:* Sunday, December 7, 2014 10:12 AM > *To:* user@cassandra.apache.org > *Subject:* Re: How to model data to achieve specific data locality > > > Also new seq_types can be added and old seq_types can be deleted. This > means I often need to ALTER TABLE to add and drop columns. > > Kai, unless I'm misunderstanding something, I don't see why you need to > alter the table to add a new seq type. From a data model perspective, > these are just new values in a row. > > If you do have columns which are specific to particular seq_types, data > modeling does become a little more challenging. In that case you may get > some advantage from using collections (especially map) to store data which > applies to only a few seq types. Or defining a schema which includes the > set of all possible columns (that's when you're getting into ALTERs when a > new column comes or goes). > > > All sequences with the same seq_id tend to grow at the same rate. > > Note that it is an anti pattern in Cassandra to append to the same row > indefinitely. I think you understand this because of your original > question. But please note that a sub partitioning strategy which reuses > subpartitions will result in degraded read performance after a while. > You'll need to rotate sub partitions by something that doesn't repeat in > order to keep the data for a given partition key grouped into just a few > sstables. A typical pattern there is to use some kind of time bucket > (hour, day, week, etc., depending on your write volume). > > I do note that your original question was about preserving data locality - > and having a consistent locality for a given seq_id - for best offline > analytics. If you wanted to work for this, you can certainly also include > a blob value in your partitioning key, whose value is calculated to force a > ring collision with this record's sibling data. With Cassandra's default > partitioner of murmur3, that's probably pretty challenging - murmur3 isn't > designed to be cryptographically strong (it doesn't work to make it > difficult to force a collision), but it's meant to have good distribution > (it may still be computationally expensive to force a collision - I'm not > that familiar with its internal workings). In this case, > ByteOrderedPartitioner would be a lot easier to force a ring collision on, > but then you need to work on a good ring balancing strategy to distribute > your data evenly over the ring. > > On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan <doanduy...@gmail.com> wrote: > >> "Those sequences are not fixed. All sequences with the same seq_id tend >> to grow at the same rate. If it's one partition per seq_id, the size will >> most likely exceed the threshold quickly" >> >> --> Then use bucketing to avoid too wide partitions >> >> "Also new seq_types can be added and old seq_types can be deleted. This >> means I often need to ALTER TABLE to add and drop columns. I am not sure if >> this is a good practice from operation point of view." >> >> --> I don't understand why altering table is necessary to add >> seq_types. If "seq_types" is defined as your clustering column, you can >> have many of them using the same table structure ... >> >> >> >> >> >> On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang <dep...@gmail.com> wrote: >> >>> On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <migh...@gmail.com> >>> wrote: >>> >>>> It depends on the size of your data, but if your data is reasonably >>>> small, there should be no trouble including thousands of records on the >>>> same partition key. So a data model using PRIMARY KEY ((seq_id), seq_type) >>>> ought to work fine. >>>> >>>> If the data size per partition exceeds some threshold that represents >>>> the right tradeoff of increasing repair cost, gc pressure, threatening >>>> unbalanced loads, and other issues that come with wide partitions, then you >>>> can subpartition via some means in a manner consistent with your work load, >>>> with something like PRIMARY KEY ((seq_id, subpartition), seq_type). >>>> >>>> For example, if seq_type can be processed for a given seq_id in any >>>> order, and you need to be able to locate specific records for a known >>>> seq_id/seq_type pair, you can compute subpartition is computed >>>> deterministically. Or if you only ever need to read *all* values for >>>> a given seq_id, and the processing order is not important, just randomly >>>> generate a value for subpartition at write time, as long as you can know >>>> all possible values for subpartition. >>>> >>>> If the values for the seq_types for a given seq_id must always be >>>> processed in order based on seq_type, then your subpartition calculation >>>> would need to reflect that and place adjacent seq_types in the same >>>> partition. As a contrived example, say seq_type was an incrementing >>>> integer, your subpartition could be seq_type / 100. >>>> >>>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote: >>>> >>>>> I have a data model question. I am trying to figure out how to model >>>>> the data to achieve the best data locality for analytic purpose. Our >>>>> application processes sequences. Each sequence has a unique key in the >>>>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited >>>>> number of seq_types. The typical read is to load a subset of sequences >>>>> with >>>>> the same seq_id. Naturally I would like to have all the sequences with the >>>>> same seq_id to co-locate on the same node(s). >>>>> >>>>> >>>>> >>>>> However I can't simply create one partition per seq_id and use seq_id >>>>> as my partition key. That's because: >>>>> >>>>> >>>>> >>>>> 1. there could be thousands or even more seq_types for each seq_id. >>>>> It's not feasible to include all the seq_types into one table. >>>>> >>>>> 2. each seq_id might have different sets of seq_types. >>>>> >>>>> 3. each application only needs to access a subset of seq_types for a >>>>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I >>>>> prefer only touching the data that's needed. >>>>> >>>>> >>>>> >>>>> As per above, I think I should use one partition per >>>>> [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? >>>>> One >>>>> possible approach is to override IPartitioner so that I just use part of >>>>> the field (say 64 bytes) to get the token (for location) while still using >>>>> the whole field as partition key (for look up). But before heading that >>>>> direction, I would like to see if there are better options out there. >>>>> Maybe >>>>> any new or upcoming features in C* 3.0? >>>>> >>>>> >>>>> >>>>> Thanks. >>>>> >>>> >>> Thanks, Eric. >>> >>> Those sequences are not fixed. All sequences with the same seq_id tend >>> to grow at the same rate. If it's one partition per seq_id, the size will >>> most likely exceed the threshold quickly. Also new seq_types can be added >>> and old seq_types can be deleted. This means I often need to ALTER TABLE to >>> add and drop columns. I am not sure if this is a good practice from >>> operation point of view. >>> >>> I thought about your subpartition idea. If there are only a few >>> applications and each one of them uses a subset of seq_types, I can easily >>> create one table per application since I can compute the subpartition >>> deterministically as you said. But in my case data scientists need to >>> easily write new applications using any combination of seq_types of a >>> seq_id. So I want the data model to be flexible enough to support >>> applications using any different set of seq_types without creating new >>> tables, duplicate all the data etc. >>> >>> -Kai >>> >>> >>> >> >