On Sun, May 22, 2011 at 11:00 AM, Yang <teddyyyy...@gmail.com> wrote: > Thanks, > > I did read through that pdf doc, and went through the counters code in > 0.8-rc2, I think I understand the logic in that code. > > in my hypothetical implementation, I am not suggesting to overstep the > complicated logic in counters code, since the extra module will still > need to enter the increment through StorageProxy.mutate( > My_counter.delta=1 ) , so that the logical clock is still handled by > the Counters code. > > the only difference is, as you said, > that rainbird collapses many +1 deltas. but my claim is that in fact > this "collapsing" is already done by cassandra since the write always > hit the memtable first, > so collapsing in Cassandra memtable vs collapsing in rainbird memory > takes the same time, while rainbird introduces an extra level of > caching (I am strongly suspecting that rainbird is vulnerable to > losing up to 1minute's worth of data , if the rainbird dies before the > writes are flushed to cassandra ---- unless it does implement its own > commit log, but that is kind of re-implementing many of the wheels in > Cassandra ....)
Right, Rainbird buffers for performance and can lose up to 1 minute of data. > I thought at one time probably the reason was because that from one > given url, rainbird needs to create writes on many keys, so that they > keys need to go to different > Cassandra nodes. but later I found that this can also be done in a > module on the coordinator, since the client request first hits a > coordinator, instead of the data node, in fact, in a multi-insert > case, the coordinator already sends the request to multiple data > nodes. the extra module I am proposing simply translates a single > insert into multi-insert, and then cassandra takes over from there > > > Thanks > Yang > > On Sun, May 22, 2011 at 3:47 AM, aaron morton <aa...@thelastpickle.com> wrote: >> The implementation of distributed counters is more complicated than your >> example, there is a design doc attached to the ticket >> here https://issues.apache.org/jira/browse/CASSANDRA-1072 >> By collapsing some of those +1 increments together at the application level >> there is less work for the cluster to do. This can be important when the >> numbers are big http://blog.twitter.com/2011/03/numbers.html >> Cheers >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >> On 21 May 2011, at 09:04, Yang wrote: >> >> (sorry if Rainbird is not a topic relevant enough, I'd appreciate if >> someone could point me to a more appropriate venue in that case) >> >> >> Rainbird buffers up 1 minute worth of events first before writing to >> Cassandra. >> >> it seems that this extra layer of buffering is repetitive, and could >> be avoided : Cassandra's memtable already does buffering, whose >> internal implementation is just >> Map.put(key, CF ) , I guess rainbird does similar things : >> column_to_count = map.get(key); column_to_count++ ; map.put(key, >> column_to_count) ?? >> the "++" part is probably already done by the Distributed Counters in >> Cassandra. >> then I guess Rainbird layer exists because it needs to parse an >> incoming event into various attributes that it is interested in: for >> example from an url, we bump up the counts of >> FQDN , domain, path etc, Rainbird does the transformation from >> url--->3 attrs. >> >> but I guess that transformation might as well be done in the cassandra >> JVM itself, if we could provide some hooks, so that a module >> translates incoming request into >> multiple keys, and bump up their counts. that way we avoid the >> intermediate communication from clients to rainbird, and rainbird to >> Cassandra. are there some points I'm missing? >> >> Thanks >> Yang >> >> >