Do you have a copy of the specific stack trace? Given the version and CL behavior, one thing you may be experiencing is: https://issues.apache.org/jira/browse/CASSANDRA-4578
On Mon, Jul 22, 2013 at 7:15 AM, cbert...@libero.it <cbert...@libero.it> wrote: > Hi Aaron, thanks for your help. > >>If you have more than 500Million rows you may want to check the > bloom_filter_fp_chance, the old default was 0.000744 and the new (post 1.) > number is > 0.01 for sized tiered. > > I really don't think I have more than 500 million rows ... any smart way to > count rows number inside the ks? > >>> Now a question -- why with 2 nodes offline all my application stop > providing >>> the service, even when a Consistency Level One read is invoked? > >>What error did the client get and what client are you using ? >>it also depends on if/how the node fails. The later versions try to shut down > when there is an OOM, not sure what 1.0 does. > > The exception was a TTransportException -- I am using Pelops client. > >>Is the node went into a zombie state the clients may have been timing out. > The should then move onto to another node. >>If it had started shutting down the client should have gotten some immediate > errors. > > It didn't shut down, it was more like in a zombie state, > One more question: I'm experiencing some wrong counters (which are very > important in my platform since the are used to keep user-points and generate > the TopX users) --could it be related with this problem? The problem is that > in > some users (not all) the counter column increased its value. > > After such a crash in 1.0 is there any best-practice to follow? (nodetool or > something?) > > Cheers, > Carlo > >> >>Cheers >> >> >>----------------- >>Aaron Morton >>Cassandra Consultant >>New Zealand >> >>@aaronmorton >>http://www.thelastpickle.com >> >>On 19/07/2013, at 5:02 PM, cbert...@libero.it wrote: >> >>> Hi all, >>> I'm experiencing some problems after 3 years of cassandra in production > (from >>> 0.6 to 1.0.6) -- for 2 times in 3 weeks 2 nodes crashed with OutOfMemory >>> Exception. >>> In the log I can read the warn about the few heap available ... now I'm >>> increasing a little bit my RAM, my Java Heap (1/4 of the RAM) and reducing > the >>> size of rows and memtables thresholds. Other tips? >>> >>> Now a question -- why with 2 nodes offline all my application stop > providing >>> the service, even when a Consistency Level One read is invoked? >>> I'd expected this behaviour: >>> >>> CL1 operations keep working >>> more than 80% of CLQ operations working (nodes offline where 2 and 5 in a >>> clockwise key distribution only writes to fifth node should impact to node > 2) >>> most of all CLALL operations (that I don't use) failing >>> >>> The situation instead was that I had ALL services stop responding throwing > a >>> TTransportException ... >>> >>> Thanks in advance >>> >>> Carlo >> >> > >