Hi, We are migrating from a 0.8.8 ring to a 1.1.2 ring and we are noticing missing data post-migration. We use pre-built/configured AMIs so our preferred route is to leave our existing production 0.8.8 untouched and bring up a parallel 1.1.2 ring and migrate data into it. Data is written to the rings via batch processes so we can easily assure that both the existing and new rings will have the same data post migration.
The ring we are migrating from is: * 12 nodes * single data-center, 3 AZs * 0.8.8 The ring we are migrating to is the same except 1.1.2. The steps we are taking are: 1. Bring up a 1.1.2 ring in the same AZ/data center configuration with tokens matching the corresponding nodes in the 0.8.8 ring. 2. Create the same keyspace on 1.1.2. 3. Create each CF in the keyspace on 1.1.2. 4. Flush each node of the 0.8.8 ring. 5. Rsync each non-compacted sstable from 0.8.8 to the corresponding node in 1.1.2. 6. Move each 0.8.8 sstable into the 1.1.2 directory structure by renaming the file to the /cassandra/data/<keyspace>/<cf>/<keyspace>-<cf>... format. For example, for the keyspace "Metrics" and CF "epochs_60" we get: "cassandra/data/Metrics/epochs_60/Metrics-epochs_60-g-941-Data.db". 7. On each 1.1.2 node run `nodetool -h localhost refresh Metrics <CF>` for each CF in the keyspace. We notice that storage load jumps accordingly. 8. On each 1.1.2 node run `nodetool -h localhost upgradesstables`. This takes awhile but appears to correctly rewrite each sstable in the new 1.1.x format. Storage load drops as sstables are compressed. After these steps we run a script that validates data on the new ring. What we've noticed is that large portions of the data that was on the 0.8.8 is not available on the 1.1.2 ring. We've tried reading at both quorum and ONE, but the resulting data appears missing in both cases. We have fewer than 143 million row keys in the CFs we're testing and none of the *-Filter.db files are > 10MB, so I don't believe this is our problem: https://issues.apache.org/jira/browse/CASSANDRA-3820 Anything else to test verify? Are the steps above correct for this type of upgrade? Is this type of upgrade/migration supported? We have also tried running a repair across the cluster after step #8. While it took a few retries due to https://issues.apache.org/jira/browse/CASSANDRA-4456, we still had missing data afterwards. Any assistance would be appreciated. Thanks! Mike -- Mike Heffner <m...@librato.com> Librato, Inc.