Re: Failed migration from 1.1.6 to 1.2.2

Alain RODRIGUEZ Thu, 14 Mar 2013 08:41:38 -0700

"I feel your pain" => I is quite big :). This morning I had 6 nodes, tried
to up one, have it decommissioned, try to replace it by an other in 1.1.6,
failed and now try to restart one of the 5 remaining nodes and it fails
restarting... Now I have 4 nodes instead of 6 and I am taking heavy load...


"Did you try restoring the snapshots you took and downgrading to 1.1.6"

Snapshot are from a few hours ago and will be used as last solution. all my
5 nodes are currently 1.1.6.

"We are barely trying to make our cluster stay up"

I wish you good luck with that.


2013/3/14 Hiller, Dean <dean.hil...@nrel.gov>

> Did you try restoring the snapshots you took and downgrading to 1.1.6
> temporarily to get the node back online?  That typically works fine.  I
> feel your pain.  We are still waiting on 12 more nodes and until then we
> are barely trying to make our cluster stay up and it is pretty much nearly
> maxed out(LCS gave us some room but only a little)….I calculated out
> changing interval_indexing could give us 3G more room as well which would
> be huge but have not figured out in QA how to make the change from 128 to
> another number.
>
> Ccm – cool, nice project….I will have to try that one sometime as well.
>
> Later,
> Dean
>
> From: Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Thursday, March 14, 2013 8:09 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Re: Failed migration from 1.1.6 to 1.2.2
>
> @Dean
>
> "It is expensive?"
>
> I was talking about a full time QA environment equal or similar to a prod
> env.
>
> I didn't thought about using a temp QA, and you are right I should have.
>
> "And sorry for not providing the detail on the rolling restart not
> working….my bad"
>
> No problem, my point was just to remember you that other member of the
> community can use this kind of information.
>
> "but also I think people on the list assume you are going to do some basic
> testing if at least to get comfortable with the process"
>
> I did, but on a local machine. That's the hardware I had, so I just tested
> it on one machine and made sure the clients were compatible... But I wasn't
> aware of ccm. I will use it next time for sure :-).
>
> @Michal
>
> Thanks about ccm.
>
> "on my workstation with a < 0.01% sample of production"
>
> Is there a simple way of getting that ?
>
> @all
>
> Any idea why my node is not restarting now ?
>
> Same result with or without -Dcassandra.load_ring_state=false.
>
> Last log lines before C* process end :
>
> INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,813 SSTableReader.java (line
> 169) Opening
> /raid0/cassandra/data/system/LocationInfo/system-LocationInfo-hf-70 (621
> bytes)
> INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,819 SSTableReader.java (line
> 169) Opening
> /raid0/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hf-465
> (66 bytes)
>
> Should I $rm /raid0/cassandra/data/system/HintsColumnFamily/* ?
>
>
>
>
>
>
>
> 2013/3/14 Hiller, Dean <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>>
> It is expensive?……personally, sorry, I don't really buy that since I spent
> less than 400 bucks on 100 servers at amazon to play with for 1 or 2 hours
> or maybe it was 8 hours…I can't remember AND you can use small instances
> for a test like this.  You can write EC2 scripts to startup a QA system for
> your needs very easily.  Now, if your company is not allowing amazon, that
> is a different story and it is expensive.  We have the same issue as
> you….lack of time though we did get some VM's and put roughly 10MB in each
> to test out an upgrade.
>
> So a basic QA test equipment wise would cost only about 50 bucks and be
> well worth the testing….the time effort would cost a bit more but usually
> companies are already paying the salaries and that was already budgeted for.
>
> And sorry for not providing the detail on the rolling restart not
> working….my bad, but also I think people on the list assume you are going
> to do some basic testing if at least to get comfortable with the process.
>
> Dean
>
> From: Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com
> ><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com>>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org
> ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
> Date: Thursday, March 14, 2013 7:41 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
> Subject: Re: Failed migration from 1.1.6 to 1.2.2
>
> @Aaron
>
> "You can try to reset the cluster ring state by doing a rolling restart
> passing -Dcassandra.load_ring_state=false as a JVM param in
> cassandra-env.sh"
>
> Now my can't restart properly. I stop restarting and last logged message
> is:
>
> INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,813 SSTableReader.java (line
> 169) Opening
> /raid0/cassandra/data/system/LocationInfo/system-LocationInfo-hf-70 (621
> bytes)
> INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,819 SSTableReader.java (line
> 169) Opening
> /raid0/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hf-465
> (66 bytes)
>
> Shoul I $rm /raid0/cassandra/data/system/HintsColumnFamily/* ?
>
> @Dean
>
> "You should really be testing this stuff in QA"
>
> We have no such environment. It is expensive, we can't afford this for now.
>
> "We had the exact same issue from 1.1.4 to 1.2.2."
>
> Well, I think you could have warned. I thought it was safe upgrading
> because I saw that you and 2 more people did it with no major issues...
>
>
> 2013/3/14 Hiller, Dean <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov
> ><mailto:dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>>>
> You should really be testing this stuff in QA.  We had the exact same
> issue from 1.1.4 to 1.2.2.  In QA, we decided we could take an outage so we
> tested taking every node down, upgrading every node and bringing the
> cluster back online.  This worked perfectly so we rolled it into
> production….production took 45 minutes to start for us(especially one node
> under pressure)….that was only initially though…now everything seems fine.
>  Another option in QA was we could have tested upgrading to 1.1.9 first
> then to 1.2.2.  I have no idea if it will work but I am sure they test
> closer release scenarios on upgrading more so than the big jump releases
>
> Aaron, it would be really neat if some releases were tagged with LT(long
> term) or something so upgrades are tested from LT to LT releases so we know
> we can always safely first upgrade to an LT release and then upgrade to
> another LT release from that one…just a thought. This would also get more
> people using/testing the same upgrade paths which would help everyone.
>
> Dean
>
> From: Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com
> ><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com>><mailto:
> arodr...@gmail.com<mailto:arodr...@gmail.com><mailto:arodr...@gmail.com
> <mailto:arodr...@gmail.com>>>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org
> ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org
> >><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org
> ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>>
> Date: Thursday, March 14, 2013 5:31 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>>
> Subject: Re: Failed migration from 1.1.6 to 1.2.2
>
> We have it set to 0.0.0.0 but anyway, as told before, I don't think our
> problem come from this bug.
>
>
> 2013/3/14 Michal Michalski <mich...@opera.com<mailto:mich...@opera.com
> ><mailto:mich...@opera.com<mailto:mich...@opera.com>><mailto:
> mich...@opera.com<mailto:mich...@opera.com><mailto:mich...@opera.com
> <mailto:mich...@opera.com>>>>
>
> It will happen if your rpc_address is set to 0.0.0.0.
>
> Ops, it's not what I meant ;-)
> It will happen, if your rpc_address is set to IP that is not defined in
> your cluster's config (e.g. in cassandra-topology.properties for
> PropertyFileSnitch)
>
>
> M.
>
>
> M.
>
> W dniu 14.03.2013 13:03, Alain RODRIGUEZ pisze:
> Thanks for this pointer but I don't think this is the source of our
> problem
> since we use 1 data center and Ec2Snitch.
>
>
>
> 2013/3/14 Jean-Armel Luce <jaluc...@gmail.com<mailto:jaluc...@gmail.com
> ><mailto:jaluc...@gmail.com<mailto:jaluc...@gmail.com>><mailto:
> jaluc...@gmail.com<mailto:jaluc...@gmail.com><mailto:jaluc...@gmail.com
> <mailto:jaluc...@gmail.com>>>>
>
> Hi Alain,
>
> Maybe it is due to https://issues.apache.org/jira/browse/CASSANDRA-5299
>
> A patch is provided with this ticket.
>
> Regards.
>
> Jean Armel
>
>
> 2013/3/14 Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com
> ><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com>><mailto:
> arodr...@gmail.com<mailto:arodr...@gmail.com><mailto:arodr...@gmail.com
> <mailto:arodr...@gmail.com>>>>
>
> Hi
>
> We just tried to migrate our production cluster from C* 1.1.6 to 1.2.2.
>
> This has been a disaster. I just switch one node to 1.2.2, updated its
> configuration (cassandra.yaml / cassandra-env.sh) and restart it.
>
> It resulted on error on all the 5 remaining 1.1.6 nodes :
>
> ERROR [RequestResponseStage:2] 2013-03-14 09:53:25,750
> AbstractCassandraDaemon.java (line 135) Exception in thread
> Thread[RequestResponseStage:2,5,main]
> java.io.IOError: java.io.EOFException
>          at
>
> org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
>
>          at
> org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:155)
>
>          at
> org.apache.cassandra.net<http://org.apache.cassandra.net><
> http://org.apache.cassandra.net><http://org.apache.cassandra.net
> >.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:45)
>
>          at
> org.apache.cassandra.net<http://org.apache.cassandra.net><
> http://org.apache.cassandra.net><http://org.apache.cassandra.net
> >.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
>
>          at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>
>          at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>
>          at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.EOFException
>          at java.io.DataInputStream.readFully(DataInputStream.java:180)
>          at
>
> org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:100)
>
>          at
>
> org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:81)
>
>          at
>
> org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)
>
>          ... 6 more
>
> I had this a lot of times, and my entire cluster wasn't reachable by
> our
> 4 clients (phpCassa, Hector, Cassie, Helenus)
>
> I decommissioned the 1.2.2 node to get our cluster answering
> queries. It
> worked.
>
> Then I tried to replace this node by a new C*1.1.6 one with the same
> token as the previous node decommissioned. The node joined the ring and
> before getting any data switch to normal status.
>
> In all the other nodes I had :
>
> ERROR [MutationStage:8] 2013-03-14 10:21:01,288
> AbstractCassandraDaemon.java (line 135) Exception in thread
> Thread[MutationStage:8,5,main]
> java.lang.AssertionError
>          at
> org.apache.cassandra.locator.TokenMetadata.getToken(TokenMetadata.java:304)
>
>          at
>
> org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:371)
>
>          at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>          at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>          at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>          at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>
>          at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>
>          at java.lang.Thread.run(Thread.java:662)
>
> So I decommissioned this new 1.1.6 node and we are now running with 5
> servers, not balanced along the ring, without any possibility of adding
> nodes, nor upgradinc C* version.
>
> We are quite desperate over here.
>
> If someone has any idea of what could happened and how to stabilize the
> cluster, it will be very appreciated.
>
> It's quite an emergency since we can't add nodes and are under heavy
> load.
>
>
>
>
>
>
>
>
>

Re: Failed migration from 1.1.6 to 1.2.2

Reply via email to