@Aaron "You can try to reset the cluster ring state by doing a rolling restart passing -Dcassandra.load_ring_state=false as a JVM param in cassandra-env.sh"
Now my can't restart properly. I stop restarting and last logged message is: INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,813 SSTableReader.java (line 169) Opening /raid0/cassandra/data/system/LocationInfo/system-LocationInfo-hf-70 (621 bytes) INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,819 SSTableReader.java (line 169) Opening /raid0/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hf-465 (66 bytes) Shoul I $rm /raid0/cassandra/data/system/HintsColumnFamily/* ? @Dean "You should really be testing this stuff in QA" We have no such environment. It is expensive, we can't afford this for now. "We had the exact same issue from 1.1.4 to 1.2.2." Well, I think you could have warned. I thought it was safe upgrading because I saw that you and 2 more people did it with no major issues... 2013/3/14 Hiller, Dean <dean.hil...@nrel.gov> > You should really be testing this stuff in QA. We had the exact same > issue from 1.1.4 to 1.2.2. In QA, we decided we could take an outage so we > tested taking every node down, upgrading every node and bringing the > cluster back online. This worked perfectly so we rolled it into > production….production took 45 minutes to start for us(especially one node > under pressure)….that was only initially though…now everything seems fine. > Another option in QA was we could have tested upgrading to 1.1.9 first > then to 1.2.2. I have no idea if it will work but I am sure they test > closer release scenarios on upgrading more so than the big jump releases > > Aaron, it would be really neat if some releases were tagged with LT(long > term) or something so upgrades are tested from LT to LT releases so we know > we can always safely first upgrade to an LT release and then upgrade to > another LT release from that one…just a thought. This would also get more > people using/testing the same upgrade paths which would help everyone. > > Dean > > From: Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com>> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Date: Thursday, March 14, 2013 5:31 AM > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Subject: Re: Failed migration from 1.1.6 to 1.2.2 > > We have it set to 0.0.0.0 but anyway, as told before, I don't think our > problem come from this bug. > > > 2013/3/14 Michal Michalski <mich...@opera.com<mailto:mich...@opera.com>> > > It will happen if your rpc_address is set to 0.0.0.0. > > Ops, it's not what I meant ;-) > It will happen, if your rpc_address is set to IP that is not defined in > your cluster's config (e.g. in cassandra-topology.properties for > PropertyFileSnitch) > > > M. > > > M. > > W dniu 14.03.2013 13:03, Alain RODRIGUEZ pisze: > Thanks for this pointer but I don't think this is the source of our > problem > since we use 1 data center and Ec2Snitch. > > > > 2013/3/14 Jean-Armel Luce <jaluc...@gmail.com<mailto:jaluc...@gmail.com>> > > Hi Alain, > > Maybe it is due to https://issues.apache.org/jira/browse/CASSANDRA-5299 > > A patch is provided with this ticket. > > Regards. > > Jean Armel > > > 2013/3/14 Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com>> > > Hi > > We just tried to migrate our production cluster from C* 1.1.6 to 1.2.2. > > This has been a disaster. I just switch one node to 1.2.2, updated its > configuration (cassandra.yaml / cassandra-env.sh) and restart it. > > It resulted on error on all the 5 remaining 1.1.6 nodes : > > ERROR [RequestResponseStage:2] 2013-03-14 09:53:25,750 > AbstractCassandraDaemon.java (line 135) Exception in thread > Thread[RequestResponseStage:2,5,main] > java.io.IOError: java.io.EOFException > at > > org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71) > > at > org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:155) > > at > org.apache.cassandra.net<http://org.apache.cassandra.net > >.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:45) > > at > org.apache.cassandra.net<http://org.apache.cassandra.net > >.MessageDeliveryTask.run(MessageDeliveryTask.java:59) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > at java.lang.Thread.run(Thread.java:662) > Caused by: java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at > > org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:100) > > at > > org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:81) > > at > > org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64) > > ... 6 more > > I had this a lot of times, and my entire cluster wasn't reachable by > our > 4 clients (phpCassa, Hector, Cassie, Helenus) > > I decommissioned the 1.2.2 node to get our cluster answering > queries. It > worked. > > Then I tried to replace this node by a new C*1.1.6 one with the same > token as the previous node decommissioned. The node joined the ring and > before getting any data switch to normal status. > > In all the other nodes I had : > > ERROR [MutationStage:8] 2013-03-14 10:21:01,288 > AbstractCassandraDaemon.java (line 135) Exception in thread > Thread[MutationStage:8,5,main] > java.lang.AssertionError > at > org.apache.cassandra.locator.TokenMetadata.getToken(TokenMetadata.java:304) > > at > > org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:371) > > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > at java.lang.Thread.run(Thread.java:662) > > So I decommissioned this new 1.1.6 node and we are now running with 5 > servers, not balanced along the ring, without any possibility of adding > nodes, nor upgradinc C* version. > > We are quite desperate over here. > > If someone has any idea of what could happened and how to stabilize the > cluster, it will be very appreciated. > > It's quite an emergency since we can't add nodes and are under heavy > load. > > > > > > >