"I feel your pain" => I is quite big :). This morning I had 6 nodes, tried to up one, have it decommissioned, try to replace it by an other in 1.1.6, failed and now try to restart one of the 5 remaining nodes and it fails restarting... Now I have 4 nodes instead of 6 and I am taking heavy load...
"Did you try restoring the snapshots you took and downgrading to 1.1.6" Snapshot are from a few hours ago and will be used as last solution. all my 5 nodes are currently 1.1.6. "We are barely trying to make our cluster stay up" I wish you good luck with that. 2013/3/14 Hiller, Dean <dean.hil...@nrel.gov> > Did you try restoring the snapshots you took and downgrading to 1.1.6 > temporarily to get the node back online? That typically works fine. I > feel your pain. We are still waiting on 12 more nodes and until then we > are barely trying to make our cluster stay up and it is pretty much nearly > maxed out(LCS gave us some room but only a little)….I calculated out > changing interval_indexing could give us 3G more room as well which would > be huge but have not figured out in QA how to make the change from 128 to > another number. > > Ccm – cool, nice project….I will have to try that one sometime as well. > > Later, > Dean > > From: Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com>> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Date: Thursday, March 14, 2013 8:09 AM > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Subject: Re: Failed migration from 1.1.6 to 1.2.2 > > @Dean > > "It is expensive?" > > I was talking about a full time QA environment equal or similar to a prod > env. > > I didn't thought about using a temp QA, and you are right I should have. > > "And sorry for not providing the detail on the rolling restart not > working….my bad" > > No problem, my point was just to remember you that other member of the > community can use this kind of information. > > "but also I think people on the list assume you are going to do some basic > testing if at least to get comfortable with the process" > > I did, but on a local machine. That's the hardware I had, so I just tested > it on one machine and made sure the clients were compatible... But I wasn't > aware of ccm. I will use it next time for sure :-). > > @Michal > > Thanks about ccm. > > "on my workstation with a < 0.01% sample of production" > > Is there a simple way of getting that ? > > @all > > Any idea why my node is not restarting now ? > > Same result with or without -Dcassandra.load_ring_state=false. > > Last log lines before C* process end : > > INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,813 SSTableReader.java (line > 169) Opening > /raid0/cassandra/data/system/LocationInfo/system-LocationInfo-hf-70 (621 > bytes) > INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,819 SSTableReader.java (line > 169) Opening > /raid0/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hf-465 > (66 bytes) > > Should I $rm /raid0/cassandra/data/system/HintsColumnFamily/* ? > > > > > > > > 2013/3/14 Hiller, Dean <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>> > It is expensive?……personally, sorry, I don't really buy that since I spent > less than 400 bucks on 100 servers at amazon to play with for 1 or 2 hours > or maybe it was 8 hours…I can't remember AND you can use small instances > for a test like this. You can write EC2 scripts to startup a QA system for > your needs very easily. Now, if your company is not allowing amazon, that > is a different story and it is expensive. We have the same issue as > you….lack of time though we did get some VM's and put roughly 10MB in each > to test out an upgrade. > > So a basic QA test equipment wise would cost only about 50 bucks and be > well worth the testing….the time effort would cost a bit more but usually > companies are already paying the salaries and that was already budgeted for. > > And sorry for not providing the detail on the rolling restart not > working….my bad, but also I think people on the list assume you are going > to do some basic testing if at least to get comfortable with the process. > > Dean > > From: Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com > ><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com>>> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org > ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>> > Date: Thursday, March 14, 2013 7:41 AM > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>> > Subject: Re: Failed migration from 1.1.6 to 1.2.2 > > @Aaron > > "You can try to reset the cluster ring state by doing a rolling restart > passing -Dcassandra.load_ring_state=false as a JVM param in > cassandra-env.sh" > > Now my can't restart properly. I stop restarting and last logged message > is: > > INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,813 SSTableReader.java (line > 169) Opening > /raid0/cassandra/data/system/LocationInfo/system-LocationInfo-hf-70 (621 > bytes) > INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,819 SSTableReader.java (line > 169) Opening > /raid0/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hf-465 > (66 bytes) > > Shoul I $rm /raid0/cassandra/data/system/HintsColumnFamily/* ? > > @Dean > > "You should really be testing this stuff in QA" > > We have no such environment. It is expensive, we can't afford this for now. > > "We had the exact same issue from 1.1.4 to 1.2.2." > > Well, I think you could have warned. I thought it was safe upgrading > because I saw that you and 2 more people did it with no major issues... > > > 2013/3/14 Hiller, Dean <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov > ><mailto:dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>>> > You should really be testing this stuff in QA. We had the exact same > issue from 1.1.4 to 1.2.2. In QA, we decided we could take an outage so we > tested taking every node down, upgrading every node and bringing the > cluster back online. This worked perfectly so we rolled it into > production….production took 45 minutes to start for us(especially one node > under pressure)….that was only initially though…now everything seems fine. > Another option in QA was we could have tested upgrading to 1.1.9 first > then to 1.2.2. I have no idea if it will work but I am sure they test > closer release scenarios on upgrading more so than the big jump releases > > Aaron, it would be really neat if some releases were tagged with LT(long > term) or something so upgrades are tested from LT to LT releases so we know > we can always safely first upgrade to an LT release and then upgrade to > another LT release from that one…just a thought. This would also get more > people using/testing the same upgrade paths which would help everyone. > > Dean > > From: Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com > ><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com>><mailto: > arodr...@gmail.com<mailto:arodr...@gmail.com><mailto:arodr...@gmail.com > <mailto:arodr...@gmail.com>>>> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org > ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org > >><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org > ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>> > Date: Thursday, March 14, 2013 5:31 AM > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>> > Subject: Re: Failed migration from 1.1.6 to 1.2.2 > > We have it set to 0.0.0.0 but anyway, as told before, I don't think our > problem come from this bug. > > > 2013/3/14 Michal Michalski <mich...@opera.com<mailto:mich...@opera.com > ><mailto:mich...@opera.com<mailto:mich...@opera.com>><mailto: > mich...@opera.com<mailto:mich...@opera.com><mailto:mich...@opera.com > <mailto:mich...@opera.com>>>> > > It will happen if your rpc_address is set to 0.0.0.0. > > Ops, it's not what I meant ;-) > It will happen, if your rpc_address is set to IP that is not defined in > your cluster's config (e.g. in cassandra-topology.properties for > PropertyFileSnitch) > > > M. > > > M. > > W dniu 14.03.2013 13:03, Alain RODRIGUEZ pisze: > Thanks for this pointer but I don't think this is the source of our > problem > since we use 1 data center and Ec2Snitch. > > > > 2013/3/14 Jean-Armel Luce <jaluc...@gmail.com<mailto:jaluc...@gmail.com > ><mailto:jaluc...@gmail.com<mailto:jaluc...@gmail.com>><mailto: > jaluc...@gmail.com<mailto:jaluc...@gmail.com><mailto:jaluc...@gmail.com > <mailto:jaluc...@gmail.com>>>> > > Hi Alain, > > Maybe it is due to https://issues.apache.org/jira/browse/CASSANDRA-5299 > > A patch is provided with this ticket. > > Regards. > > Jean Armel > > > 2013/3/14 Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com > ><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com>><mailto: > arodr...@gmail.com<mailto:arodr...@gmail.com><mailto:arodr...@gmail.com > <mailto:arodr...@gmail.com>>>> > > Hi > > We just tried to migrate our production cluster from C* 1.1.6 to 1.2.2. > > This has been a disaster. I just switch one node to 1.2.2, updated its > configuration (cassandra.yaml / cassandra-env.sh) and restart it. > > It resulted on error on all the 5 remaining 1.1.6 nodes : > > ERROR [RequestResponseStage:2] 2013-03-14 09:53:25,750 > AbstractCassandraDaemon.java (line 135) Exception in thread > Thread[RequestResponseStage:2,5,main] > java.io.IOError: java.io.EOFException > at > > org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71) > > at > org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:155) > > at > org.apache.cassandra.net<http://org.apache.cassandra.net>< > http://org.apache.cassandra.net><http://org.apache.cassandra.net > >.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:45) > > at > org.apache.cassandra.net<http://org.apache.cassandra.net>< > http://org.apache.cassandra.net><http://org.apache.cassandra.net > >.MessageDeliveryTask.run(MessageDeliveryTask.java:59) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > at java.lang.Thread.run(Thread.java:662) > Caused by: java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at > > org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:100) > > at > > org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:81) > > at > > org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64) > > ... 6 more > > I had this a lot of times, and my entire cluster wasn't reachable by > our > 4 clients (phpCassa, Hector, Cassie, Helenus) > > I decommissioned the 1.2.2 node to get our cluster answering > queries. It > worked. > > Then I tried to replace this node by a new C*1.1.6 one with the same > token as the previous node decommissioned. The node joined the ring and > before getting any data switch to normal status. > > In all the other nodes I had : > > ERROR [MutationStage:8] 2013-03-14 10:21:01,288 > AbstractCassandraDaemon.java (line 135) Exception in thread > Thread[MutationStage:8,5,main] > java.lang.AssertionError > at > org.apache.cassandra.locator.TokenMetadata.getToken(TokenMetadata.java:304) > > at > > org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:371) > > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > at java.lang.Thread.run(Thread.java:662) > > So I decommissioned this new 1.1.6 node and we are now running with 5 > servers, not balanced along the ring, without any possibility of adding > nodes, nor upgradinc C* version. > > We are quite desperate over here. > > If someone has any idea of what could happened and how to stabilize the > cluster, it will be very appreciated. > > It's quite an emergency since we can't add nodes and are under heavy > load. > > > > > > > > >