Thanks for responding!

My natural partition key is a customer id. Our customers have widely
varying amounts of data. Since the vast majority of them have data that's
small enough to fit in a single partition, I'd like to avoid imposing
unnecessary overhead on the 99% just to avoid issues with the largest 1%.

The approach to querying across multiple partitions you describe is pretty
much what I have in mind. The trick is to avoid having to query 50
partitions to return a few hundred or thousand rows.

I agree that sequentially filling partitions is something to avoid. That's
why I'm hoping someone can suggest a good alternative.

Jim





On Mon, Jan 4, 2016 at 8:07 PM, Clint Martin <
clintlmar...@coolfiretechnologies.com> wrote:

> You should endeavor to use a repeatable method of segmenting your data.
> Swapping partitions every time you "fill one" seems like an anti pattern to
> me. but I suppose it really depends on what your primary key is. Can you
> share some more information on this?
>
> In the past I have utilized the consistent hash method you described (add
> an artificial row key segment by modulo some part of the clustering key by
> a fixed position count) combined with a lazy evaluation cursor.
>
> The lazy evaluation cursor essentially is set up to query X number of
> partitions simultaneously, but to execute those queries only add needed to
> fill the page size. To perform paging you have to know the last primary key
> that was returned so you can use that to limit the next iteration.
>
> You can trade latency for additional work load by controlling the number
> of concurrent executions you do as the iterating occurs. Or you can
> minimize the work on your cluster by querying each partition one at a time.
>
> Unfortunately due to the artificial partition key segment you cannot
> iterate or page in any particular order...(at least across partitions)
> Unless your hash function can also provide you some ordering guarantees.
>
> It all just depends on your requirements.
>
> Clint
> On Jan 4, 2016 10:13 AM, "Jim Ancona" <j...@anconafamily.com> wrote:
>
>> A problem that I have run into repeatedly when doing schema design is how
>> to control partition size while still allowing for efficient multi-row
>> queries.
>>
>> We want to limit partition size to some number between 10 and 100
>> megabytes to avoid operational issues. The standard way to do that is to
>> figure out the maximum number of rows that your "natural partition key"
>> will ever need to support and then add an additional artificial partition
>> key that segments the rows sufficiently to get keep the partition size
>> under the maximum. In the case of time series data, this is often done by
>> bucketing by time period, i.e. creating a new partition every minute, hour
>> or day. For non-time series data by doing something like
>> Hash(clustering-key) mod desired-number-of-partitions.
>>
>> In my case, multi-row queries to support a REST API typically return a
>> page of results, where the page size might be anywhere from a few dozen up
>> to thousands. For query efficiency I want the average number of rows per
>> partition to be large enough that a query can be satisfied by reading a
>> small number of partitions--ideally one.
>>
>> So I want to simultaneously limit the maximum number of rows per
>> partition and yet maintain a large enough average number of rows per
>> partition to make my queries efficient. But with my data the ratio between
>> maximum and average can be very large (up to four orders of magnitude).
>>
>> Here is an example:
>>
>>
>> Rows per Partition
>>
>> Partition Size
>>
>> Mode
>>
>> 1
>>
>> 1 KB
>>
>> Median
>>
>> 500
>>
>> 500 KB
>>
>> 90th percentile
>>
>> 5,000
>>
>> 5 MB
>>
>> 99th percentile
>>
>> 50,000
>>
>> 50 MB
>>
>> Maximum
>>
>> 2,500,000
>>
>> 2.5 GB
>>
>> In this case, 99% of my data could fit in a single 50 MB partition. But
>> if I use the standard approach, I have to split my partitions into 50
>> pieces to accommodate the largest data. That means that to query the 700
>> rows for my median case, I have to read 50 partitions instead of one.
>>
>> If you try to deal with this by starting a new partition when an old one
>> fills up, you have a nasty distributed consensus problem, along with
>> read-before-write. Cassandra LWT wasn't available the last time I dealt
>> with this, but might help with the consensus part today. But there are
>> still some nasty corner cases.
>>
>> I have some thoughts on other ways to solve this, but they all have
>> drawbacks. So I thought I'd ask here and hope that someone has a better
>> approach.
>>
>> Thanks in advance,
>>
>> Jim
>>
>>

Reply via email to