Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

Daniel Doubleday Thu, 05 May 2011 06:00:54 -0700

Don't know if that helps you but since we had the same SSTable corruption I 
have been looking into that very code the other day:


If you could afford to drop these rows and are able to recognize them the 
easiest way would be patching:

SSTableScanner:162

public IColumnIterator next()
        {
            try
            {
                if (row != null)
                    file.seek(finishedAt);
                assert !file.isEOF();

                DecoratedKey key = SSTableReader.decodeKey(sstable.partitioner,
                                                           sstable.descriptor,
                                                           
ByteBufferUtil.readWithShortLength(file));
                long dataSize = SSTableReader.readRowSize(file, 
sstable.descriptor);
                long dataStart = file.getFilePointer();
                finishedAt = dataStart + dataSize;

                if (filter == null)
                {
                    row = new SSTableIdentityIterator(sstable, file, key, 
dataStart, dataSize);
                    return row;
                }
                else
                {
                    return row = filter.getSSTableColumnIterator(sstable, file, 
key);
                }
            }
            catch (IOException e)
            {
                throw new RuntimeException(SSTableScanner.this + " failed to 
provide next columns from " + this, e);
            }
        }

The string key is new String(ByteBufferUtil.getArray(key.key), "UTF-8")
If you find one that you don't like just skip it.

This way compaction goes through but obviously you'll loose data.

On May 5, 2011, at 1:12 PM, Henrik Schröder wrote:

> Yeah, I've seen that one, and I'm guessing that it's the root cause of my 
> problems, something something encoding error, but that doesn't really help 
> me. :-)
> 
> However, I've done all my tests with 0.7.5, I'm gonna try them again with 
> 0.7.4, just to see how that version reacts.
> 
> 
> /Henrik
> 
> On Wed, May 4, 2011 at 18:53, Daniel Doubleday <daniel.double...@gmx.net> 
> wrote:
> This is a bit of a wild guess but Windows and encoding and 0.7.5 sounds like
> 
> https://issues.apache.org/jira/browse/CASSANDRA-2367
> 
>  
> On May 3, 2011, at 5:15 PM, Henrik Schröder wrote:
> 
>> Hey everyone,
>> 
>> We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7, 
>> just to make sure that the change in how keys are encoded wouldn't cause us 
>> any dataloss. Unfortunately it seems that rows stored under a unicode key 
>> couldn't be retrieved after the upgrade. We're running everything on 
>> Windows, and we're using the generated thrift client in C# to access it.
>> 
>> I managed to make a minimal test to reproduce the error consistently:
>> 
>> First, I started up Cassandra 0.6.13 with an empty data directory, and a 
>> really simple config with a single keyspace with a single bytestype 
>> columnfamily.
>> I wrote two rows, each with a single column with a simple column name and a 
>> 1-byte value of "1". The first row had a key using only ascii chars ('foo'), 
>> and the second row had a key using unicode chars ('ドメインウ').
>> 
>> Using multi_get, and both those keys, I got both columns back, as expected.
>> Using multi_get_slice and both those keys, I got both columns back, as 
>> expected.
>> I also did a get_range_slices to get all rows in the columnfamily, and I got 
>> both columns back, as expected.
>> 
>> So far so good. Then I drain and shut down Cassandra 0.6.13, and start up 
>> Cassandra 0.7.5, pointing to the same data directory, with a config 
>> containing the same keyspace, and I run the schematool import command.
>> 
>> I then start up my test program that uses the new thrift api, and run some 
>> commands.
>> 
>> Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I 
>> only get back one column, the one under the key 'foo'. The other row I 
>> simply can't retrieve.
>> 
>> However, when I use get_range_slices to get all rows, I get back two rows, 
>> with the correct column values, and the byte-array keys are identical to my 
>> encoded keys, and when I decode the byte-arrays as UTF8 drings, I get back 
>> my two original keys. This means that both my rows are still there, the keys 
>> as output by Cassandra are identical to the original string keys I used when 
>> I created the rows in 0.6.13, but it's just impossible to retrieve the 
>> second row.
>> 
>> To continue the test, I inserted a row with the key 'ドメインウ' encoded as UTF-8 
>> again, and gave it a similar column as the original, but with a 1-byte value 
>> of "2".
>> 
>> Now, when I use multi_get_slice with my two encoded keys, I get back two 
>> rows, the 'foo' row has the old value as expected, and the other row has the 
>> new value as expected.
>> 
>> However, when I use get_range_slices to get all rows, I get back *three* 
>> rows, two of which have the *exact same* byte-array key, one has the old 
>> column, one has the new column. 
>> 
>> 
>> How is this possible? How can there be two different rows with the exact 
>> same key? I'm guessing that it's related to the encoding of string keys in 
>> 0.6, and that the internal representation is off somehow. I checked the 
>> generated thrift client for 0.6, and it UTF8-encodes all keys before sending 
>> them to the server, so it should be UTF8 all the way, but apparently it 
>> isn't.
>> 
>> Has anyone else experienced the same problem? Is it a platform-specific 
>> problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not 
>> lose any rows? I would also really like to know which byte-array I should 
>> send in to get back that second row, there's gotta be some key that can be 
>> used to get it, the row is still there after all.
>> 
>> 
>> /Henrik Schröder
> 
>

Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

Reply via email to