Re: Data Modeling: Partition Size and Query Efficiency

Jim Ancona Tue, 05 Jan 2016 11:29:16 -0800

Hi Nate,

Yes, I've been thinking about treating customers as either small or big,
where "small" ones have a single partition and big ones have 50 (or
whatever number I need to keep sizes reasonable). There's still the problem
of how to handle a small customer who becomes too big, but that will happen
much less frequently than a customer filling a partition.


Jim

On Tue, Jan 5, 2016 at 12:21 PM, Nate McCall <n...@thelastpickle.com> wrote:

>
>> In this case, 99% of my data could fit in a single 50 MB partition. But
>> if I use the standard approach, I have to split my partitions into 50
>> pieces to accommodate the largest data. That means that to query the 700
>> rows for my median case, I have to read 50 partitions instead of one.
>>
>> If you try to deal with this by starting a new partition when an old one
>> fills up, you have a nasty distributed consensus problem, along with
>> read-before-write. Cassandra LWT wasn't available the last time I dealt
>> with this, but might help with the consensus part today. But there are
>> still some nasty corner cases.
>>
>> I have some thoughts on other ways to solve this, but they all have
>> drawbacks. So I thought I'd ask here and hope that someone has a better
>> approach.
>>
>>
> Hi Jim - good to see you around again.
>
> If you can segment this upstream by customer/account/whatever, handling
> the outliers as an entirely different code path (potentially different
> cluster as the workload will be quite different at that point and have
> different tuning requirements) would be your best bet. Then a
> read-before-write makes sense given it is happening on such a small number
> of API queries.
>
>
> --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Re: Data Modeling: Partition Size and Query Efficiency

Reply via email to