Re: Data Modeling: Partition Size and Query Efficiency

Jim Ancona Tue, 05 Jan 2016 14:17:37 -0800

On Tue, Jan 5, 2016 at 4:56 PM, Clint Martin <
clintlmar...@coolfiretechnologies.com> wrote:


> What sort of data is your clustering key composed of? That might help some
> in determining a way to achieve what you're looking for.
>
Just a UUID that acts as an object identifier.

>
> Clint
> On Jan 5, 2016 2:28 PM, "Jim Ancona" <j...@anconafamily.com> wrote:
>
>> Hi Nate,
>>
>> Yes, I've been thinking about treating customers as either small or big,
>> where "small" ones have a single partition and big ones have 50 (or
>> whatever number I need to keep sizes reasonable). There's still the problem
>> of how to handle a small customer who becomes too big, but that will happen
>> much less frequently than a customer filling a partition.
>>
>> Jim
>>
>> On Tue, Jan 5, 2016 at 12:21 PM, Nate McCall <n...@thelastpickle.com>
>> wrote:
>>
>>>
>>>> In this case, 99% of my data could fit in a single 50 MB partition. But
>>>> if I use the standard approach, I have to split my partitions into 50
>>>> pieces to accommodate the largest data. That means that to query the 700
>>>> rows for my median case, I have to read 50 partitions instead of one.
>>>>
>>>> If you try to deal with this by starting a new partition when an old
>>>> one fills up, you have a nasty distributed consensus problem, along with
>>>> read-before-write. Cassandra LWT wasn't available the last time I dealt
>>>> with this, but might help with the consensus part today. But there are
>>>> still some nasty corner cases.
>>>>
>>>> I have some thoughts on other ways to solve this, but they all have
>>>> drawbacks. So I thought I'd ask here and hope that someone has a better
>>>> approach.
>>>>
>>>>
>>> Hi Jim - good to see you around again.
>>>
>>> If you can segment this upstream by customer/account/whatever, handling
>>> the outliers as an entirely different code path (potentially different
>>> cluster as the workload will be quite different at that point and have
>>> different tuning requirements) would be your best bet. Then a
>>> read-before-write makes sense given it is happening on such a small number
>>> of API queries.
>>>
>>>
>>> --
>>> -----------------
>>> Nate McCall
>>> Austin, TX
>>> @zznate
>>>
>>> Co-Founder & Sr. Technical Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>
>>

Re: Data Modeling: Partition Size and Query Efficiency

Reply via email to