Re: rainbird question (why is the 1minute buffer needed?)

Ryan King Mon, 23 May 2011 11:55:31 -0700

On Sun, May 22, 2011 at 11:00 AM, Yang <teddyyyy...@gmail.com> wrote:
> Thanks,
>
> I did read through that pdf doc, and went through the counters code in
> 0.8-rc2, I think I understand the logic in that code.
>
> in my hypothetical implementation, I am not suggesting to overstep the
> complicated logic in counters code, since the extra module will still
> need to enter the increment through StorageProxy.mutate(
> My_counter.delta=1 ) , so that the logical clock is still handled by
> the Counters code.
>
>  the only difference is, as you said,
> that rainbird collapses many +1 deltas. but my claim is that in fact
> this "collapsing" is already done by cassandra since the write always
> hit the memtable  first,
> so collapsing in Cassandra memtable vs collapsing in rainbird  memory
> takes the same time, while rainbird introduces an extra level of
> caching (I am strongly suspecting that rainbird is vulnerable to
> losing up to 1minute's worth of data , if the rainbird dies before the
> writes are flushed to cassandra ---- unless it does implement its own
> commit log, but that is kind of  re-implementing many of the wheels in
> Cassandra ....)


Right, Rainbird buffers for performance and can lose up to 1 minute of data.

> I thought at one time probably the reason was because that from one
> given url, rainbird needs to create writes on many keys, so that they
> keys need to go to different
> Cassandra nodes. but later I found that this can also be done in a
> module on the coordinator, since the client request first hits a
> coordinator, instead of the data node, in fact, in a multi-insert
> case, the coordinator already sends the request to multiple data
> nodes. the extra module I am proposing simply translates a single
> insert into multi-insert, and then cassandra takes over from there
>
>
> Thanks
> Yang
>
> On Sun, May 22, 2011 at 3:47 AM, aaron morton <aa...@thelastpickle.com> wrote:
>>  The implementation of distributed counters is  more complicated than your
>> example, there is a design doc attached to the ticket
>> here https://issues.apache.org/jira/browse/CASSANDRA-1072
>> By collapsing some of those +1 increments together at the application level
>> there is less work for the cluster to do. This can be important when the
>> numbers are big http://blog.twitter.com/2011/03/numbers.html
>> Cheers
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> On 21 May 2011, at 09:04, Yang wrote:
>>
>> (sorry if Rainbird is not a topic relevant enough, I'd appreciate if
>> someone could point me to a more appropriate venue in that case)
>>
>>
>> Rainbird buffers up 1 minute worth of events first before writing to
>> Cassandra.
>>
>> it seems that this extra layer of buffering is repetitive, and could
>> be avoided : Cassandra's memtable already does buffering, whose
>> internal implementation is just
>> Map.put(key, CF ) , I guess rainbird does similar things :
>> column_to_count = map.get(key); column_to_count++ ; map.put(key,
>> column_to_count) ??
>> the "++" part is probably already done by the Distributed Counters in
>> Cassandra.
>> then I guess Rainbird layer exists because it needs to parse an
>> incoming event into various attributes that it is interested in: for
>> example from an url, we bump up the counts of
>> FQDN , domain, path etc, Rainbird does the transformation from
>> url--->3 attrs.
>>
>> but I guess that transformation might as well be done in the cassandra
>> JVM itself, if we could provide some hooks, so that a module
>> translates incoming request into
>> multiple keys, and bump up their counts. that way we avoid the
>> intermediate communication from clients to rainbird,  and rainbird to
>> Cassandra. are there some points I'm missing?
>>
>> Thanks
>> Yang
>>
>>
>

Re: rainbird question (why is the 1minute buffer needed?)

Reply via email to