Thanks for responding! My natural partition key is a customer id. Our customers have widely varying amounts of data. Since the vast majority of them have data that's small enough to fit in a single partition, I'd like to avoid imposing unnecessary overhead on the 99% just to avoid issues with the largest 1%.
The approach to querying across multiple partitions you describe is pretty much what I have in mind. The trick is to avoid having to query 50 partitions to return a few hundred or thousand rows. I agree that sequentially filling partitions is something to avoid. That's why I'm hoping someone can suggest a good alternative. Jim On Mon, Jan 4, 2016 at 8:07 PM, Clint Martin < clintlmar...@coolfiretechnologies.com> wrote: > You should endeavor to use a repeatable method of segmenting your data. > Swapping partitions every time you "fill one" seems like an anti pattern to > me. but I suppose it really depends on what your primary key is. Can you > share some more information on this? > > In the past I have utilized the consistent hash method you described (add > an artificial row key segment by modulo some part of the clustering key by > a fixed position count) combined with a lazy evaluation cursor. > > The lazy evaluation cursor essentially is set up to query X number of > partitions simultaneously, but to execute those queries only add needed to > fill the page size. To perform paging you have to know the last primary key > that was returned so you can use that to limit the next iteration. > > You can trade latency for additional work load by controlling the number > of concurrent executions you do as the iterating occurs. Or you can > minimize the work on your cluster by querying each partition one at a time. > > Unfortunately due to the artificial partition key segment you cannot > iterate or page in any particular order...(at least across partitions) > Unless your hash function can also provide you some ordering guarantees. > > It all just depends on your requirements. > > Clint > On Jan 4, 2016 10:13 AM, "Jim Ancona" <j...@anconafamily.com> wrote: > >> A problem that I have run into repeatedly when doing schema design is how >> to control partition size while still allowing for efficient multi-row >> queries. >> >> We want to limit partition size to some number between 10 and 100 >> megabytes to avoid operational issues. The standard way to do that is to >> figure out the maximum number of rows that your "natural partition key" >> will ever need to support and then add an additional artificial partition >> key that segments the rows sufficiently to get keep the partition size >> under the maximum. In the case of time series data, this is often done by >> bucketing by time period, i.e. creating a new partition every minute, hour >> or day. For non-time series data by doing something like >> Hash(clustering-key) mod desired-number-of-partitions. >> >> In my case, multi-row queries to support a REST API typically return a >> page of results, where the page size might be anywhere from a few dozen up >> to thousands. For query efficiency I want the average number of rows per >> partition to be large enough that a query can be satisfied by reading a >> small number of partitions--ideally one. >> >> So I want to simultaneously limit the maximum number of rows per >> partition and yet maintain a large enough average number of rows per >> partition to make my queries efficient. But with my data the ratio between >> maximum and average can be very large (up to four orders of magnitude). >> >> Here is an example: >> >> >> Rows per Partition >> >> Partition Size >> >> Mode >> >> 1 >> >> 1 KB >> >> Median >> >> 500 >> >> 500 KB >> >> 90th percentile >> >> 5,000 >> >> 5 MB >> >> 99th percentile >> >> 50,000 >> >> 50 MB >> >> Maximum >> >> 2,500,000 >> >> 2.5 GB >> >> In this case, 99% of my data could fit in a single 50 MB partition. But >> if I use the standard approach, I have to split my partitions into 50 >> pieces to accommodate the largest data. That means that to query the 700 >> rows for my median case, I have to read 50 partitions instead of one. >> >> If you try to deal with this by starting a new partition when an old one >> fills up, you have a nasty distributed consensus problem, along with >> read-before-write. Cassandra LWT wasn't available the last time I dealt >> with this, but might help with the consensus part today. But there are >> still some nasty corner cases. >> >> I have some thoughts on other ways to solve this, but they all have >> drawbacks. So I thought I'd ask here and hope that someone has a better >> approach. >> >> Thanks in advance, >> >> Jim >> >>