On Tue, Jan 5, 2016 at 5:52 PM, Jonathan Haddad <[email protected]> wrote:
> You could keep a "num_buckets" value associated with the client's account, > which can be adjusted accordingly as usage increases. > Yes, but the adjustment problem is tricky when there are multiple concurrent writers. What happens when you change the number of buckets? Does existing data have to be re-written into new buckets? If so, how do you make sure that's only done once for each bucket size increase? Or perhaps I'm misunderstanding your suggestion? Jim > On Tue, Jan 5, 2016 at 2:17 PM Jim Ancona <[email protected]> wrote: > >> On Tue, Jan 5, 2016 at 4:56 PM, Clint Martin < >> [email protected]> wrote: >> >>> What sort of data is your clustering key composed of? That might help >>> some in determining a way to achieve what you're looking for. >>> >> Just a UUID that acts as an object identifier. >> >>> >>> Clint >>> On Jan 5, 2016 2:28 PM, "Jim Ancona" <[email protected]> wrote: >>> >>>> Hi Nate, >>>> >>>> Yes, I've been thinking about treating customers as either small or >>>> big, where "small" ones have a single partition and big ones have 50 (or >>>> whatever number I need to keep sizes reasonable). There's still the problem >>>> of how to handle a small customer who becomes too big, but that will happen >>>> much less frequently than a customer filling a partition. >>>> >>>> Jim >>>> >>>> On Tue, Jan 5, 2016 at 12:21 PM, Nate McCall <[email protected]> >>>> wrote: >>>> >>>>> >>>>>> In this case, 99% of my data could fit in a single 50 MB partition. >>>>>> But if I use the standard approach, I have to split my partitions into 50 >>>>>> pieces to accommodate the largest data. That means that to query the 700 >>>>>> rows for my median case, I have to read 50 partitions instead of one. >>>>>> >>>>>> If you try to deal with this by starting a new partition when an old >>>>>> one fills up, you have a nasty distributed consensus problem, along with >>>>>> read-before-write. Cassandra LWT wasn't available the last time I dealt >>>>>> with this, but might help with the consensus part today. But there are >>>>>> still some nasty corner cases. >>>>>> >>>>>> I have some thoughts on other ways to solve this, but they all have >>>>>> drawbacks. So I thought I'd ask here and hope that someone has a better >>>>>> approach. >>>>>> >>>>>> >>>>> Hi Jim - good to see you around again. >>>>> >>>>> If you can segment this upstream by customer/account/whatever, >>>>> handling the outliers as an entirely different code path (potentially >>>>> different cluster as the workload will be quite different at that point >>>>> and >>>>> have different tuning requirements) would be your best bet. Then a >>>>> read-before-write makes sense given it is happening on such a small number >>>>> of API queries. >>>>> >>>>> >>>>> -- >>>>> ----------------- >>>>> Nate McCall >>>>> Austin, TX >>>>> @zznate >>>>> >>>>> Co-Founder & Sr. Technical Consultant >>>>> Apache Cassandra Consulting >>>>> http://www.thelastpickle.com >>>>> >>>> >>>>
