Hi,

We are migrating from a 0.8.8 ring to a 1.1.2 ring and we are noticing
missing data post-migration. We use pre-built/configured AMIs so our
preferred route is to leave our existing production 0.8.8 untouched and
bring up a parallel 1.1.2 ring and migrate data into it. Data is written to
the rings via batch processes so we can easily assure that both the
existing and new rings will have the same data post migration.

The ring we are migrating from is:

  * 12 nodes
  * single data-center, 3 AZs
  * 0.8.8

The ring we are migrating to is the same except 1.1.2.

The steps we are taking are:

1. Bring up a 1.1.2 ring in the same AZ/data center configuration with
tokens matching the corresponding nodes in the 0.8.8 ring.
2. Create the same keyspace on 1.1.2.
3. Create each CF in the keyspace on 1.1.2.
4. Flush each node of the 0.8.8 ring.
5. Rsync each non-compacted sstable from 0.8.8 to the corresponding node in
1.1.2.
6. Move each 0.8.8 sstable into the 1.1.2 directory structure by renaming
the file to the  /cassandra/data/<keyspace>/<cf>/<keyspace>-<cf>... format.
For example, for the keyspace "Metrics" and CF "epochs_60" we get:
"cassandra/data/Metrics/epochs_60/Metrics-epochs_60-g-941-Data.db".
7. On each 1.1.2 node run `nodetool -h localhost refresh Metrics <CF>` for
each CF in the keyspace. We notice that storage load jumps accordingly.
8. On each 1.1.2 node run `nodetool -h localhost upgradesstables`. This
takes awhile but appears to correctly rewrite each sstable in the new 1.1.x
format. Storage load drops as sstables are compressed.

After these steps we run a script that validates data on the new ring. What
we've noticed is that large portions of the data that was on the 0.8.8 is
not available on the 1.1.2 ring. We've tried reading at both quorum and
ONE, but the resulting data appears missing in both cases.

We have fewer than 143 million row keys in the CFs we're testing and none
of the *-Filter.db files are > 10MB, so I don't believe this is our
problem: https://issues.apache.org/jira/browse/CASSANDRA-3820

Anything else to test verify? Are the steps above correct for this type of
upgrade? Is this type of upgrade/migration supported?

We have also tried running a repair across the cluster after step #8. While
it took a few retries due to
https://issues.apache.org/jira/browse/CASSANDRA-4456, we still had missing
data afterwards.

Any assistance would be appreciated.


Thanks!

Mike

-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.

Reply via email to