Re: Data Modeling: Partition Size and Query Efficiency

Jim Ancona Wed, 06 Jan 2016 06:05:57 -0800

On Tue, Jan 5, 2016 at 5:52 PM, Jonathan Haddad <[email protected]> wrote:


> You could keep a "num_buckets" value associated with the client's account,
> which can be adjusted accordingly as usage increases.
>

Yes, but the adjustment problem is tricky when there are multiple
concurrent writers. What happens when you change the number of buckets?
Does existing data have to be re-written into new buckets? If so, how do
you make sure that's only done once for each bucket size increase? Or
perhaps I'm misunderstanding your suggestion?

Jim


> On Tue, Jan 5, 2016 at 2:17 PM Jim Ancona <[email protected]> wrote:
>
>> On Tue, Jan 5, 2016 at 4:56 PM, Clint Martin <
>> [email protected]> wrote:
>>
>>> What sort of data is your clustering key composed of? That might help
>>> some in determining a way to achieve what you're looking for.
>>>
>> Just a UUID that acts as an object identifier.
>>
>>>
>>> Clint
>>> On Jan 5, 2016 2:28 PM, "Jim Ancona" <[email protected]> wrote:
>>>
>>>> Hi Nate,
>>>>
>>>> Yes, I've been thinking about treating customers as either small or
>>>> big, where "small" ones have a single partition and big ones have 50 (or
>>>> whatever number I need to keep sizes reasonable). There's still the problem
>>>> of how to handle a small customer who becomes too big, but that will happen
>>>> much less frequently than a customer filling a partition.
>>>>
>>>> Jim
>>>>
>>>> On Tue, Jan 5, 2016 at 12:21 PM, Nate McCall <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>>> In this case, 99% of my data could fit in a single 50 MB partition.
>>>>>> But if I use the standard approach, I have to split my partitions into 50
>>>>>> pieces to accommodate the largest data. That means that to query the 700
>>>>>> rows for my median case, I have to read 50 partitions instead of one.
>>>>>>
>>>>>> If you try to deal with this by starting a new partition when an old
>>>>>> one fills up, you have a nasty distributed consensus problem, along with
>>>>>> read-before-write. Cassandra LWT wasn't available the last time I dealt
>>>>>> with this, but might help with the consensus part today. But there are
>>>>>> still some nasty corner cases.
>>>>>>
>>>>>> I have some thoughts on other ways to solve this, but they all have
>>>>>> drawbacks. So I thought I'd ask here and hope that someone has a better
>>>>>> approach.
>>>>>>
>>>>>>
>>>>> Hi Jim - good to see you around again.
>>>>>
>>>>> If you can segment this upstream by customer/account/whatever,
>>>>> handling the outliers as an entirely different code path (potentially
>>>>> different cluster as the workload will be quite different at that point 
>>>>> and
>>>>> have different tuning requirements) would be your best bet. Then a
>>>>> read-before-write makes sense given it is happening on such a small number
>>>>> of API queries.
>>>>>
>>>>>
>>>>> --
>>>>> -----------------
>>>>> Nate McCall
>>>>> Austin, TX
>>>>> @zznate
>>>>>
>>>>> Co-Founder & Sr. Technical Consultant
>>>>> Apache Cassandra Consulting
>>>>> http://www.thelastpickle.com
>>>>>
>>>>
>>>>

Re: Data Modeling: Partition Size and Query Efficiency

Reply via email to