Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

Daniel Doubleday Thu, 05 May 2011 05:52:20 -0700

Thats UTF-8 not UTF-16.

On May 5, 2011, at 1:57 PM, aaron morton wrote:


> The hard core way to fix the data is export to json with sstable2json, hand 
> edit, and then json2sstable it back. 
> 
> Also to confirm, this only happens when data is written in 0.6 and then tried 
> to read back in 0.7?
> 
> And you what partitioner are you using ? You can still see the keys ?
> 
> Can you use sstable2json agains tthe 0.6 data ?
> 
> Looking at you last email something looks fishy about the encoding...
> "
> My two keys that I send in my test program are 0xe695b0e69982e99693 and 
> 0x666f6f, which decodes to "数時間" and "foo" respectively.
> "
> 
> There are 9 bytes encoded there I would expect a multiple of 2 for each 
> character. (using UTF-16 surrogate pairs 
> http://en.wikipedia.org/wiki/UTF-16/UCS-2 )
> 
> I looked the characters up and their encoding is different here 
> 数 0x6570 http://www.fileformat.info/info/unicode/char/6570/index.htm
> 時 0x6642 http://www.fileformat.info/info/unicode/char/6642/index.htm 
> 間 0x9593 http://www.fileformat.info/info/unicode/char/9593/index.htm
> 
> Am I missing something ?
> 
> Hope that helps. 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 5 May 2011, at 23:09, Henrik Schröder wrote:
> 
>> Yes, the keys were written to 0.6, but when I looked through the thrift 
>> client code for 0.6, it explicitly converts all string keys to UTF8 before 
>> sending them over to the server so the encoding *should* be right, and after 
>> the upgrade to 0.7.5, sstablekeys prints out the correct byte values for 
>> those keys, but Cassandra itself is unable to get those rows.
>> 
>> I ran some more tests yesterday with a clean database where I only wrote two 
>> rows, one with an ascii key and one with a unicode key, upgraded to 0.7.5, 
>> ran nodetool cleanup, and that actually fixed it. After cleanup, the server 
>> could fetch both rows correctly.
>> 
>> However, when I tried to do the same thing with a snapshot of our live 
>> database where we have ~2 million keys, out of which ~1000 are unicode, 
>> cleanup failed with a lot of "Keys must be written in descending order" 
>> exceptions. I've tried various combinations of cleanup and scrub, running 
>> cleanup before upgrading, etc, but I've yet to find something that fixes all 
>> the problems without losing those rows.
>> 
>> 
>> /Henrik
>> 
>> On Thu, May 5, 2011 at 12:48, aaron morton <aa...@thelastpickle.com> wrote:
>> I take it back, the problem started in 0.6 where keys were strings. Looking 
>> into how 0.6 did it's thing
>> 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 5 May 2011, at 22:36, aaron morton wrote:
>> 
>>> Interesting but as we are dealing with keys it should not matter as they 
>>> are treated as byte buffers. 
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 5 May 2011, at 04:53, Daniel Doubleday wrote:
>>> 
>>>> This is a bit of a wild guess but Windows and encoding and 0.7.5 sounds 
>>>> like
>>>> 
>>>> https://issues.apache.org/jira/browse/CASSANDRA-2367
>>>> 
>>>>  
>>>> On May 3, 2011, at 5:15 PM, Henrik Schröder wrote:
>>>> 
>>>>> Hey everyone,
>>>>> 
>>>>> We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7, 
>>>>> just to make sure that the change in how keys are encoded wouldn't cause 
>>>>> us any dataloss. Unfortunately it seems that rows stored under a unicode 
>>>>> key couldn't be retrieved after the upgrade. We're running everything on 
>>>>> Windows, and we're using the generated thrift client in C# to access it.
>>>>> 
>>>>> I managed to make a minimal test to reproduce the error consistently:
>>>>> 
>>>>> First, I started up Cassandra 0.6.13 with an empty data directory, and a 
>>>>> really simple config with a single keyspace with a single bytestype 
>>>>> columnfamily.
>>>>> I wrote two rows, each with a single column with a simple column name and 
>>>>> a 1-byte value of "1". The first row had a key using only ascii chars 
>>>>> ('foo'), and the second row had a key using unicode chars ('ドメインウ').
>>>>> 
>>>>> Using multi_get, and both those keys, I got both columns back, as 
>>>>> expected.
>>>>> Using multi_get_slice and both those keys, I got both columns back, as 
>>>>> expected.
>>>>> I also did a get_range_slices to get all rows in the columnfamily, and I 
>>>>> got both columns back, as expected.
>>>>> 
>>>>> So far so good. Then I drain and shut down Cassandra 0.6.13, and start up 
>>>>> Cassandra 0.7.5, pointing to the same data directory, with a config 
>>>>> containing the same keyspace, and I run the schematool import command.
>>>>> 
>>>>> I then start up my test program that uses the new thrift api, and run 
>>>>> some commands.
>>>>> 
>>>>> Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I 
>>>>> only get back one column, the one under the key 'foo'. The other row I 
>>>>> simply can't retrieve.
>>>>> 
>>>>> However, when I use get_range_slices to get all rows, I get back two 
>>>>> rows, with the correct column values, and the byte-array keys are 
>>>>> identical to my encoded keys, and when I decode the byte-arrays as UTF8 
>>>>> drings, I get back my two original keys. This means that both my rows are 
>>>>> still there, the keys as output by Cassandra are identical to the 
>>>>> original string keys I used when I created the rows in 0.6.13, but it's 
>>>>> just impossible to retrieve the second row.
>>>>> 
>>>>> To continue the test, I inserted a row with the key 'ドメインウ' encoded as 
>>>>> UTF-8 again, and gave it a similar column as the original, but with a 
>>>>> 1-byte value of "2".
>>>>> 
>>>>> Now, when I use multi_get_slice with my two encoded keys, I get back two 
>>>>> rows, the 'foo' row has the old value as expected, and the other row has 
>>>>> the new value as expected.
>>>>> 
>>>>> However, when I use get_range_slices to get all rows, I get back *three* 
>>>>> rows, two of which have the *exact same* byte-array key, one has the old 
>>>>> column, one has the new column. 
>>>>> 
>>>>> 
>>>>> How is this possible? How can there be two different rows with the exact 
>>>>> same key? I'm guessing that it's related to the encoding of string keys 
>>>>> in 0.6, and that the internal representation is off somehow. I checked 
>>>>> the generated thrift client for 0.6, and it UTF8-encodes all keys before 
>>>>> sending them to the server, so it should be UTF8 all the way, but 
>>>>> apparently it isn't.
>>>>> 
>>>>> Has anyone else experienced the same problem? Is it a platform-specific 
>>>>> problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not 
>>>>> lose any rows? I would also really like to know which byte-array I should 
>>>>> send in to get back that second row, there's gotta be some key that can 
>>>>> be used to get it, the row is still there after all.
>>>>> 
>>>>> 
>>>>> /Henrik Schröder
>>>> 
>>> 
>> 
>> 
>

Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

Reply via email to