Re: Data Modeling: Partition Size and Query Efficiency

Jonathan Haddad Tue, 05 Jan 2016 14:53:21 -0800

You could keep a "num_buckets" value associated with the client's account,
which can be adjusted accordingly as usage increases.


On Tue, Jan 5, 2016 at 2:17 PM Jim Ancona <[email protected]> wrote:

> On Tue, Jan 5, 2016 at 4:56 PM, Clint Martin <
> [email protected]> wrote:
>
>> What sort of data is your clustering key composed of? That might help
>> some in determining a way to achieve what you're looking for.
>>
> Just a UUID that acts as an object identifier.
>
>>
>> Clint
>> On Jan 5, 2016 2:28 PM, "Jim Ancona" <[email protected]> wrote:
>>
>>> Hi Nate,
>>>
>>> Yes, I've been thinking about treating customers as either small or big,
>>> where "small" ones have a single partition and big ones have 50 (or
>>> whatever number I need to keep sizes reasonable). There's still the problem
>>> of how to handle a small customer who becomes too big, but that will happen
>>> much less frequently than a customer filling a partition.
>>>
>>> Jim
>>>
>>> On Tue, Jan 5, 2016 at 12:21 PM, Nate McCall <[email protected]>
>>> wrote:
>>>
>>>>
>>>>> In this case, 99% of my data could fit in a single 50 MB partition.
>>>>> But if I use the standard approach, I have to split my partitions into 50
>>>>> pieces to accommodate the largest data. That means that to query the 700
>>>>> rows for my median case, I have to read 50 partitions instead of one.
>>>>>
>>>>> If you try to deal with this by starting a new partition when an old
>>>>> one fills up, you have a nasty distributed consensus problem, along with
>>>>> read-before-write. Cassandra LWT wasn't available the last time I dealt
>>>>> with this, but might help with the consensus part today. But there are
>>>>> still some nasty corner cases.
>>>>>
>>>>> I have some thoughts on other ways to solve this, but they all have
>>>>> drawbacks. So I thought I'd ask here and hope that someone has a better
>>>>> approach.
>>>>>
>>>>>
>>>> Hi Jim - good to see you around again.
>>>>
>>>> If you can segment this upstream by customer/account/whatever, handling
>>>> the outliers as an entirely different code path (potentially different
>>>> cluster as the workload will be quite different at that point and have
>>>> different tuning requirements) would be your best bet. Then a
>>>> read-before-write makes sense given it is happening on such a small number
>>>> of API queries.
>>>>
>>>>
>>>> --
>>>> -----------------
>>>> Nate McCall
>>>> Austin, TX
>>>> @zznate
>>>>
>>>> Co-Founder & Sr. Technical Consultant
>>>> Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>
>>>

Re: Data Modeling: Partition Size and Query Efficiency

Reply via email to