Re: Failed migration from 1.1.6 to 1.2.2

Hiller, Dean Thu, 14 Mar 2013 08:32:46 -0700

Did you try restoring the snapshots you took and downgrading to 1.1.6 
temporarily to get the node back online?  That typically works fine.  I feel 
your pain.  We are still waiting on 12 more nodes and until then we are barely 
trying to make our cluster stay up and it is pretty much nearly maxed out(LCS 
gave us some room but only a little)….I calculated out changing 
interval_indexing could give us 3G more room as well which would be huge but 
have not figured out in QA how to make the change from 128 to another number.

Ccm – cool, nice project….I will have to try that one sometime as well.

Later,
Dean

From: Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Thursday, March 14, 2013 8:09 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Failed migration from 1.1.6 to 1.2.2

@Dean

"It is expensive?"

I was talking about a full time QA environment equal or similar to a prod env.

I didn't thought about using a temp QA, and you are right I should have.

"And sorry for not providing the detail on the rolling restart not working….my 
bad"

No problem, my point was just to remember you that other member of the 
community can use this kind of information.

"but also I think people on the list assume you are going to do some basic 
testing if at least to get comfortable with the process"

I did, but on a local machine. That's the hardware I had, so I just tested it 
on one machine and made sure the clients were compatible... But I wasn't aware 
of ccm. I will use it next time for sure :-).

@Michal

Thanks about ccm.

"on my workstation with a < 0.01% sample of production"

Is there a simple way of getting that ?

@all

Any idea why my node is not restarting now ?

Same result with or without -Dcassandra.load_ring_state=false.

Last log lines before C* process end :

INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,813 SSTableReader.java (line 169) 
Opening /raid0/cassandra/data/system/LocationInfo/system-LocationInfo-hf-70 
(621 bytes)
INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,819 SSTableReader.java (line 169) 
Opening 
/raid0/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hf-465 
(66 bytes)

Should I $rm /raid0/cassandra/data/system/HintsColumnFamily/* ?

2013/3/14 Hiller, Dean <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>>
It is expensive?……personally, sorry, I don't really buy that since I spent less 
than 400 bucks on 100 servers at amazon to play with for 1 or 2 hours or maybe 
it was 8 hours…I can't remember AND you can use small instances for a test like 
this.  You can write EC2 scripts to startup a QA system for your needs very 
easily.  Now, if your company is not allowing amazon, that is a different story 
and it is expensive.  We have the same issue as you….lack of time though we did 
get some VM's and put roughly 10MB in each to test out an upgrade.

So a basic QA test equipment wise would cost only about 50 bucks and be well 
worth the testing….the time effort would cost a bit more but usually companies 
are already paying the salaries and that was already budgeted for.

And sorry for not providing the detail on the rolling restart not working….my 
bad, but also I think people on the list assume you are going to do some basic 
testing if at least to get comfortable with the process.

Dean

From: Alain RODRIGUEZ 
<arodr...@gmail.com<mailto:arodr...@gmail.com><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com>>>
Reply-To: 
"user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>"

<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
Date: Thursday, March 14, 2013 7:41 AM
To: 
"user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>"

<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
Subject: Re: Failed migration from 1.1.6 to 1.2.2

@Aaron

"You can try to reset the cluster ring state by doing a rolling restart passing 
-Dcassandra.load_ring_state=false as a JVM param in cassandra-env.sh"

Now my can't restart properly. I stop restarting and last logged message is:

INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,813 SSTableReader.java (line 169) 
Opening /raid0/cassandra/data/system/LocationInfo/system-LocationInfo-hf-70 
(621 bytes)
INFO [SSTableBatchOpen:1] 2013-03-14 14:36:09,819 SSTableReader.java (line 169) 
Opening 
/raid0/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hf-465 
(66 bytes)

Shoul I $rm /raid0/cassandra/data/system/HintsColumnFamily/* ?

@Dean

"You should really be testing this stuff in QA"

We have no such environment. It is expensive, we can't afford this for now.

"We had the exact same issue from 1.1.4 to 1.2.2."

Well, I think you could have warned. I thought it was safe upgrading because I 
saw that you and 2 more people did it with no major issues...

2013/3/14 Hiller, Dean 
<dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov><mailto:dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>>>
You should really be testing this stuff in QA.  We had the exact same issue 
from 1.1.4 to 1.2.2.  In QA, we decided we could take an outage so we tested 
taking every node down, upgrading every node and bringing the cluster back 
online.  This worked perfectly so we rolled it into production….production took 
45 minutes to start for us(especially one node under pressure)….that was only 
initially though…now everything seems fine.  Another option in QA was we could 
have tested upgrading to 1.1.9 first then to 1.2.2.  I have no idea if it will 
work but I am sure they test closer release scenarios on upgrading more so than 
the big jump releases

Aaron, it would be really neat if some releases were tagged with LT(long term) 
or something so upgrades are tested from LT to LT releases so we know we can 
always safely first upgrade to an LT release and then upgrade to another LT 
release from that one…just a thought. This would also get more people 
using/testing the same upgrade paths which would help everyone.

Dean

From: Alain RODRIGUEZ 
<arodr...@gmail.com<mailto:arodr...@gmail.com><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com>><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com>>>>
Reply-To: 
"user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>"

<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>>
Date: Thursday, March 14, 2013 5:31 AM
To: 
"user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>"

<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>>
Subject: Re: Failed migration from 1.1.6 to 1.2.2

We have it set to 0.0.0.0 but anyway, as told before, I don't think our problem 
come from this bug.

2013/3/14 Michal Michalski 
<mich...@opera.com<mailto:mich...@opera.com><mailto:mich...@opera.com<mailto:mich...@opera.com>><mailto:mich...@opera.com<mailto:mich...@opera.com><mailto:mich...@opera.com<mailto:mich...@opera.com>>>>

It will happen if your rpc_address is set to 0.0.0.0.

Ops, it's not what I meant ;-)
It will happen, if your rpc_address is set to IP that is not defined in your 
cluster's config (e.g. in cassandra-topology.properties for PropertyFileSnitch)

M.

M.

W dniu 14.03.2013 13:03, Alain RODRIGUEZ pisze:
Thanks for this pointer but I don't think this is the source of our
problem
since we use 1 data center and Ec2Snitch.

2013/3/14 Jean-Armel Luce 
<jaluc...@gmail.com<mailto:jaluc...@gmail.com><mailto:jaluc...@gmail.com<mailto:jaluc...@gmail.com>><mailto:jaluc...@gmail.com<mailto:jaluc...@gmail.com><mailto:jaluc...@gmail.com<mailto:jaluc...@gmail.com>>>>

Hi Alain,

Maybe it is due to https://issues.apache.org/jira/browse/CASSANDRA-5299

A patch is provided with this ticket.

Regards.

Jean Armel

2013/3/14 Alain RODRIGUEZ 
<arodr...@gmail.com<mailto:arodr...@gmail.com><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com>><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com><mailto:arodr...@gmail.com<mailto:arodr...@gmail.com>>>>

Hi

We just tried to migrate our production cluster from C* 1.1.6 to 1.2.2.

This has been a disaster. I just switch one node to 1.2.2, updated its
configuration (cassandra.yaml / cassandra-env.sh) and restart it.

It resulted on error on all the 5 remaining 1.1.6 nodes :

ERROR [RequestResponseStage:2] 2013-03-14 09:53:25,750
AbstractCassandraDaemon.java (line 135) Exception in thread
Thread[RequestResponseStage:2,5,main]
java.io.IOError: java.io.EOFException
         at
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)

         at
org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:155)

         at
org.apache.cassandra.net<http://org.apache.cassandra.net><http://org.apache.cassandra.net><http://org.apache.cassandra.net>.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:45)

         at
org.apache.cassandra.net<http://org.apache.cassandra.net><http://org.apache.cassandra.net><http://org.apache.cassandra.net>.MessageDeliveryTask.run(MessageDeliveryTask.java:59)

         at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

         at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.EOFException
         at java.io.DataInputStream.readFully(DataInputStream.java:180)
         at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:100)

         at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:81)

         at
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)

         ... 6 more

I had this a lot of times, and my entire cluster wasn't reachable by
our
4 clients (phpCassa, Hector, Cassie, Helenus)

I decommissioned the 1.2.2 node to get our cluster answering
queries. It
worked.

Then I tried to replace this node by a new C*1.1.6 one with the same
token as the previous node decommissioned. The node joined the ring and
before getting any data switch to normal status.

In all the other nodes I had :

ERROR [MutationStage:8] 2013-03-14 10:21:01,288
AbstractCassandraDaemon.java (line 135) Exception in thread
Thread[MutationStage:8,5,main]
java.lang.AssertionError
         at
org.apache.cassandra.locator.TokenMetadata.getToken(TokenMetadata.java:304)

         at
org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:371)

         at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
         at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
         at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

         at java.lang.Thread.run(Thread.java:662)

So I decommissioned this new 1.1.6 node and we are now running with 5
servers, not balanced along the ring, without any possibility of adding
nodes, nor upgradinc C* version.

We are quite desperate over here.

If someone has any idea of what could happened and how to stabilize the
cluster, it will be very appreciated.

It's quite an emergency since we can't add nodes and are under heavy
load.

Re: Failed migration from 1.1.6 to 1.2.2

Reply via email to