> and would it really hurt anything to add something like "can't > handle load" to the exception message? Feel free to add a ticket with your experience. The event you triggered is a safety valve to stop the server failing.
> - My total replication factor is 4 over two DCs -- I suppose you mean 3 > replicas in each DC? Yes, 3 in each DC. Right now the cluster QUOURM is 3 and you cannot achieve that in one DC. So if you code uses QUOURM it will fail if there is a partition between the two DC's. Normally people use LOCAL_QUOURM or maybe EACH_QUOURM for writes. The LOCAL_QUOURM for a DC with RF 2 is 2. So if your code uses either of those you do not have any redundancy. If you have RF 3 in each of the 3 DC's, using LOCAL_QUORUM or EACH_QUOURM means you can handle one down node in each DC and each DC can operate independently if needed. > - Does that mean I'll have to run at least 4 nodes in each DC? (3 for RF:3 > and one additional in case one fails) 3 nodes and RF 3 is ok. Cheers ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 23/01/2013, at 8:35 PM, Sergey Olefir <solf.li...@gmail.com> wrote: > Thanks! > > Node writing to log because it cannot handle load is much different than > node writing to log "just because". Although the amount of logging is still > excessive and would it really hurt anything to add something like "can't > handle load" to the exception message? > > On the subject of RF:3 -- could you please elaborate? > - Why RF:3 is important? (vs e.g. 2) > - My total replication factor is 4 over two DCs -- I suppose you mean 3 > replicas in each DC? > - Does that mean I'll have to run at least 4 nodes in each DC? (3 for RF:3 > and one additional in case one fails) > > (and again -- thanks Aaron! You've been helping me A LOT on this list.) > Best regards, > Sergey > > > aaron morton wrote >>> Replication is configured as DC1:2,DC2:2 (i.e. every node holds the >>> entire >>> data). >> I really recommend using RF 3. >> >> >> The error is the coordinator node protecting it's self. >> >> Basically it cannot handle the volume of local writes + the writes for HH. >> The number of in flight hints is greater than… >> >> private static volatile int maxHintsInProgress = 1024 * >> Runtime.getRuntime().availableProcessors(); >> >> You may be able to work around this by reducing the max_hint_window_in_ms >> in the yaml file so that hints are recorded if say the node has been down >> for more than 1 minute. >> >> Anyways I would say your test showed that the current cluster does not >> have sufficient capacity to handle the write load with one node down and >> HH enabled at the current level. You can either add more nodes, use nodes >> with more cores, adjust the HH settings, or reduce the throughput. >> >> >>>> On the subject of bug report -- I probably will -- but I'll wait a bit >>>> for >> >> perhaps the excessive logging could be handled better, please add a ticket >> when you have time. >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> New Zealand >> >> @aaronmorton >> http://www.thelastpickle.com >> >> On 23/01/2013, at 2:12 PM, Rob Coli < > >> rcoli@ > >> > wrote: >> >>> On Tue, Jan 22, 2013 at 2:57 PM, Sergey Olefir < > >> solf.lists@ > >> > wrote: >>>> Do you have a suggestion as to what could be a better fit for counters? >>>> Something that can also replicate across DCs and survive link breakdown >>>> between nodes (across DCs)? (and no, I don't need 100.00% precision >>>> (although it would be nice obviously), I just need to be "pretty close" >>>> for >>>> the values of "pretty") >>> >>> In that case, Cassandra counters are probably fine. >>> >>>> On the subject of bug report -- I probably will -- but I'll wait a bit >>>> for >>>> more info here, perhaps there's some configuration or something that I >>>> just >>>> don't know about. >>> >>> Excepting on replicateOnWrite stage seems pretty unambiguous to me, >>> and unexpected. YMMV? >>> >>> =Rob >>> >>> -- >>> =Robert Coli >>> AIM>ALK - > >> rcoli@ > >>> YAHOO - rcoli.palominob >>> SKYPE - rcoli_palominodb > > > > > > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/node-down-log-explosion-tp7584932p7584960.html > Sent from the cassandra-u...@incubator.apache.org mailing list archive at > Nabble.com.