On Mon, Jul 25, 2011 at 11:24 AM, Sylvain Lebresne <sylv...@datastax.com> wrote:
> On Mon, Jul 25, 2011 at 7:35 PM, Aaron Turner <synfina...@gmail.com> wrote:
>> On Sun, Jul 24, 2011 at 3:36 PM, aaron morton <aa...@thelastpickle.com> 
>> wrote:
>>> What's your use case ? There are people out there having good times with 
>>> counters, see
>>>
>>> http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
>>> http://www.scribd.com/doc/59830692/Cassandra-at-Twitter
>>
>> It's actually pretty similar to Twitter's click counting, but
>> apparently we have different requirements for accuracy.  It's possible
>> Rainbird does something on the front end to solve for this issue- I'm
>> honestly not sure since they haven't released the code yet.
>>
>> Anyways, when you're building network aggregate graphs and fail to add
>> the +100G of traffic from one switch to your site or metro aggregate,
>> people around here notice.  And people quickly start distrusting
>> graphs which don't look "real" and either ignore them completely or
>> complain.
>>
>> Obviously, one should manage their Cassandra cluster to limit the
>> occurrence of Timeouts, but frankly I don't want to be paged at 2am to
>> "fix" these kind of problems.  If I knew "timeout" meant "failed to
>> increment counter", I could spool my changes on the client and try
>> again later, but that's not what timeout means.  Without any means to
>> recover I've actually lost a lot of reliability that I currently have
>> with my single PostgreSQL server backed data store.
>
> Just to make it very clear: *nobody* is arguing this is not a limitation.
>
> The thing is some find counters useful even while perfectly aware of
> that limitation and seems to be very productive with it, so we have
> added them. Truth is, if you can live with the limitations and manage
> the timeout to a bare minimum (hopefully 0), then you won't find much
> system that are able to scale counting both in term of number of
> counters and number of ops/s on each counter, and that across
> datacenters, like Cassandra counters does. And let's recall that
> while you don't know what happened on a timeout, you at least know
> when those happens, so you can compute the error margin.
>
> Again, this does not mean we don't want to fix the limitations, nor
> that we want you to wake up at 2am, and there is actually a ticket
> open for that:
> https://issues.apache.org/jira/browse/CASSANDRA-2495
> The problem is, so far, we haven't found any satisfying solution to
> that problem. If someone has a solution, please, please, share!
>
> But yes, counters in their current state don't fit everyone needs
> and we certainly don't want to hide it.

I think the Cassandra community has been pretty open about the
limitations and I can see there are some uses for them in their
current state.  Probably my biggest concern is that I'm pretty new to
Cassandra and don't understand why occasionally I see timeouts even
under very low load (one single threaded client).  Once I understood
the impacts wrt counters it went from "annoying" to "oh crap".

Anyways, as I said earlier, I understand this problem is "hard" and I
don't expect a fix in 0.8.2 :)

Mostly right now I'm just bummed because I'm pretty much back at
square one trying to create a scalable solution which meets our needs.
  Not to say Cassandra won't be a part of it, but just that the
solution design has become a lot less obvious.


-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Reply via email to