I consistently keep losing my keyspace on upgrading from cassandra 1.1.1 to
1.1.5
I have the same cassandra keyspace on all our staging systems:
development: a 3-node cluster
integration: a 3-node cluster
QS: a 2-node cluster
(productive will be a 4-node cluster, which is as yet not active)
All clusters were running cassandra 1.1.1. Before going productive I wanted to
upgrade to the
latest productive version of cassandra.
In all cases my keyspace disappeared when I started the cluster with cassandra
1.1.5.
On the development system I didn't realize at first what was happening. I just
wondered that nodetool
showed a very low amount of data. On integration I saw the problem quickly, but
could not recover the
data. I re-installed the cassandra cluster from scratch, and populated it with
our test data, so our
developers could work.
I am currently using the QS system to recreate the problem and try to find what
I am doing wrong,
and how I can avoid losing productive data once we are live.
Basically I was doing the following:
1. create a snapshot on every node
2. create a tar.gz of my data directory, just to be safe
3. shut down and re-start cassandra 1.1.1 (just to see that it is not the
re-start that is creating the problem)
4. verify that the keyspace is still known, and the data present.
5. shut down cassandra 1.1.1
6. copy the config to cassandra 1.1.5 (doing a diff of cassandra.yaml to the
new one first, to see whether anything important has changed)
7. start cassandra 1.1.5
In the log file, after the "Replaying ..." messages I find the following:
INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped
759 mutations from unknown (probably removed) CF with id 1187
INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped
606 mutations from unknown (probably removed) CF with id 1186
INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped
53 mutations from unknown (probably removed) CF with id 1185
INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped
1945 mutations from unknown (probably removed) CF with id 1184
INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped
1945 mutations from unknown (probably removed) CF with id 1191
INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped
7506 mutations from unknown (probably removed) CF with id 1190
INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped
88 mutations from unknown (probably removed) CF with id 1189
INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped
87 mutations from unknown (probably removed) CF with id 1188
INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped
354 mutations from unknown (probably removed) CF with id 1195
INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped
87 mutations from unknown (probably removed) CF with id 1194
INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped
45 mutations from unknown (probably removed) CF with id 1192
INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped
82 mutations from unknown (probably removed) CF with id 1197
INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped
46386 mutations from unknown (probably removed) CF with id 1177
INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped
69 mutations from unknown (probably removed) CF with id 1178
INFO [main] 2012-09-19 15:15:50,325 CommitLogReplayer.java (line 103) Skipped
73 mutations from unknown (probably removed) CF with id 1179
INFO [main] 2012-09-19 15:15:50,325 CommitLogReplayer.java (line 103) Skipped
88 mutations from unknown (probably removed) CF with id 1181
INFO [main] 2012-09-19 15:15:50,325 CommitLogReplayer.java (line 103) Skipped
46386 mutations from unknown (probably removed) CF with id 1182
INFO [main] 2012-09-19 15:15:50,325 CommitLogReplayer.java (line 103) Skipped
7506 mutations from unknown (probably removed) CF with id 1183
INFO [main] 2012-09-19 15:15:50,325 CommitLog.java (line 131) Log replay
complete, 0 replayed mutations
This is the first obvious indication something is wrong. Going further up in
the log file I discover that the SSTableReader logs only system keyspace files.
Currently my cluster is in the folloing state:
node 1 runs cassandra 1.1.5, and doesn't know my keyspace
node 2 runs cassandra 1.1.1, and still nows my keyspace.
nodetool ring confirms this: node a has a load of 29kb, node 2 of roughly 1GB.
The cluster itself is still intact, i.e. nodetool ring shows both nodes.
I tried a nodetool resetlocalschema, and nodetool repair, but that didn't
change anything.
Any idea what I have been doing wrong (the preferred solution), or whether I
stumbled over a cassandra bug (not so nice)?
TIA, Thomas