Update - I am still experiencing the above issues, but not all the time. I was able to run repair (on this keyspace) from node 2 and from node 4, but now a different keyspace hangs on these nodes, and I am still not able to run repair on node 1. It seems random. I changed logging to debug level, but still nothing is logged. Again - any help will be appreciated.
Tamar On Mon, Dec 2, 2013 at 11:53 AM, Tamar Rosen <ta...@correlor.com> wrote: > Hi, > > On AWS, we had a 2 node cluster with RF 2. > We added 2 more nodes, then changed RF to 3 on all our keyspaces. > Next step was to run nodetool repair, node by node. > (In the meantime, we found that we must use CL quorum, which is affecting > our application's performance). > Started with node 1, which is one of the old nodes. > Ran: > nodetool repair -pr > > It seemed to be progressing fine, running keyspace by keyspace, for about > an hour, but then it hung. The last messages in the output are: > > [2013-12-01 11:18:24,577] Repair command #4 finished > [2013-12-01 11:18:24,594] Starting repair command #5, repairing 230 ranges > for keyspace correlor_customer_766 > > It stayed like this for almost 24 hours. Then we read about the > possibility of this being related to not upgrading > sstables<http://comments.gmane.org/gmane.comp.db.cassandra.user/31939>, > so we killed the process. We were not sure whether we had run upgrade > sstables (we upgraded from 1.2.4 a couple of months ago) > > So: > Ran upgradesstables on a specific table in the keyspace that repair got > stuck on. (this was fast) > nodetool upgradesstables correlor_customer_766 users > Ran repair on that same table. > nodetool repair correlor_customer_766 users -pr > > This is again hanging. > The first and only output from this process is: > [2013-12-02 08:22:41,221] Starting repair command #6, repairing 230 ranges > for keyspace correlor_customer_766 > > Nothing else happened for more than an hour. > > Any help and advice will be greatly appreciated. > > Tamar Rosen > > correlor.com > > > > > >