RE: nodetool drain running for days

Paco Trujillo Wed, 06 Apr 2016 23:55:25 -0700

Hi Jeff

Thanks for your answer.



-          Regarding to the drain, I will proceed as you indicate. Run a flush 
and then shutdown the node.


-          Regarding to the Cassandra updates. I want to upgrade the version of 
the cluster because we are having problems with timeouts (all the nodes became 
unresponsive) when running compactions. In other question I ask in this list to 
find a solution to this problem someone recommended me two things: add new 
nodes to the cluster(I already have ordered two new nodes) and upgrade the 
version of Cassandra because the version 2.0.17 is very old and newer versions, 
especially 2.1 had a lot of improvements in terms of performance (which is 
probably the problem we are facing).



-          We have a demo environment but unafortunately the cluster test does 
not have the same size and we cannot replicate the data from the live cluster 
on the test cluster because of the size of the live data. Anyway in our cases, 
surprises will not cost millions of dolars and certainly not my job. We are 
also a small company, so even if we upgrade the test cluster probably nobody ( 
I mean real users) will test the application who uses the cluster. This means, 
that probably we will not detect the bugs even in the test cluster.


From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: woensdag 6 april 2016 19:13
To: user@cassandra.apache.org
Subject: Re: nodetool drain running for days

Drain should not run for days – if it were me, I’d be:

 *   Checking for ‘DRAINED’ in the server logs
 *   Running ‘nodetool flush’ just to explicitly flush the commitlog/memtables 
(generally useful before doing drain, too, it can be somewhat race-y)
 *   Explicitly killing cassandra following the flush – drain should simply be 
a flush+shutdowneverything, so it should take on the order of seconds, not days.


For your question about 3.0: historically, Cassandra has had some bugs in new 
major versions -

Hints were broken from 1.0.0 to 1.0.3 - 
https://issues.apache.org/jira/browse/CASSANDRA-3466
Hints were broken again from 1.1.0 to 1.1.6 - 
https://issues.apache.org/jira/browse/CASSANDRA-4772
There was a corruption bug in 2.0 until 2.0.8 - 
https://issues.apache.org/jira/browse/CASSANDRA-6285
There were a number of rough edges in 2.1, including a memory leak fixed in 
2.1.7 - https://issues.apache.org/jira/browse/CASSANDRA-9549
Compaction kept stopping in 2.2.0 until 2.2.2 - 
https://issues.apache.org/jira/browse/CASSANDRA-10270

Because of this history of “bugs in new versions", many operators choose to 
hold off on going to new versions until they’re “better tested”. The catch-22 
is obvious here: if nobody uses it, nobody tests it in the real world to find 
the bugs not discovered in automated testing. The Datastax folks did some 
awesome work for 3.0 to extend the unit and distributed tests – they’re MUCH 
better than they were in 2.2, so hopefully there are fewer surprise bugs in 3+, 
but there’s bound to be a few. The apache team has also changed the release 
cycle to release more frequently, so that there’s less new code in each release 
(see http://www.planetcassandra.org/blog/cassandra-2-2-3-0-and-beyond/ ).  If 
you’ve got a lab/demo/stage/test environment that can tolerate some outages, I 
definitely encourage you to upgrade there, first. If a few surprise issues will 
cost your company millions of dollars, or will cost you your job, let someone 
else upgrade and be the guinea pig, and don’t upgrade until you’re compelled to 
do so because of a bug fix you need, or a feature that won’t be in the version 
you’re running.



From: Paco Trujillo
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
Date: Tuesday, April 5, 2016 at 11:12 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
Subject: nodetool drain running for days

We are having performance problems with our cluster regarding to timeouts when 
repairs are running or massive deletes. One of the advice I received was update 
our casssandra version from 2.0.17 to 2.2. I am draining one of the nodes to 
start the upgrade and the drain is running now for two days. In the logs  only 
see log like these from time to time:

INFO [ScheduledTasks:1] 2016-04-06 08:17:10,987 ColumnFamilyStore.java (line 
808) Enqueuing flush of Memtable-sstable_activity@1382334976(15653/226669 
serialized/live bytes, 6023 ops)
INFO [FlushWriter:1468] 2016-04-06 08:17:10,988 Memtable.java (line 362) 
Writing Memtable-sstable_activity@1382334976(15653/226669 serialized/live 
bytes, 6023 ops)
INFO [ScheduledTasks:1] 2016-04-06 08:17:11,004 ColumnFamilyStore.java (line 
808) Enqueuing flush of Memtable-compaction_history@1425848386(1599/15990 
serialized/live bytes, 51 ops)
INFO [FlushWriter:1468] 2016-04-06 08:17:11,012 Memtable.java (line 402) 
Completed flushing 
/var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-4826-Data.db
 (6348 bytes) for commitlog position ReplayPosition(segmentId=1458540068021, 
position=1198022)
INFO [FlushWriter:1468] 2016-04-06 08:17:11,012 Memtable.java (line 362) 
Writing Memtable-compaction_history@1425848386(1599/15990 serialized/live 
bytes, 51 ops)
INFO [FlushWriter:1468] 2016-04-06 08:17:11,039 Memtable.java (line 402) 
Completed flushing 
/var/lib/cassandra/data/system/compaction_history/system-compaction_history-jb-3491-Data.db
 (730 bytes) for commitlog position ReplayPosition(segmentId=1458540068021, 
position=1202850)

Should I wait or just stop the node and start the migration?

Another question, I have check the changes in 3.0 and I do not see any 
incompatibilities with the features we are using at this moment or with our 
actual hardware (apart from the java version). Probably more people ask this, 
but there is some important reason for not upgrade the cluster?

RE: nodetool drain running for days

Reply via email to