Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

Henrik Schröder Thu, 05 May 2011 10:22:19 -0700

I can't run sstable2json on the datafiles from 0.7, it throws the same "Keys
must be written in ascending order." error as compaction.
I can run sstable2json on the 0.6 datafiles, but when I tested that the
unicode characters in the keys got completely mangled since it outputs keys
in string format, not byte format.


I'm using the random partitioner, and as I said before, in my little test I
can see the keys using get_range_slices and sstablekeys, but not using get
or get_multi_slices. But the actual datafiles with our live data are too
corrupted to get anything, sstablekeys throws the error, cleanup throws the
error, etc.


/Henrik Schröder

On Thu, May 5, 2011 at 13:57, aaron morton <aa...@thelastpickle.com> wrote:

> The hard core way to fix the data is export to json with sstable2json, hand
> edit, and then json2sstable it back.
>
> Also to confirm, this only happens when data is written in 0.6 and then
> tried to read back in 0.7?
>
> And you what partitioner are you using ? You can still see the keys ?
>
> Can you use sstable2json agains tthe 0.6 data ?
>
> Looking at you last email something looks fishy about the encoding...
> "
> My two keys that I send in my test program are 0xe695b0e69982e99693 and
> 0x666f6f, which decodes to "数時間" and "foo" respectively.
> "
>
> There are 9 bytes encoded there I would expect a multiple of 2 for each
> character. (using UTF-16 surrogate pairs
> http://en.wikipedia.org/wiki/UTF-16/UCS-2 )
>
> I looked the characters up and their encoding is different here
> 数 0x6570 http://www.fileformat.info/info/unicode/char/6570/index.htm
> 時 0x6642 http://www.fileformat.info/info/unicode/char/6642/index.htm
> 間 0x9593 http://www.fileformat.info/info/unicode/char/9593/index.htm
>
> Am I missing something ?
>
> Hope that helps.
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 5 May 2011, at 23:09, Henrik Schröder wrote:
>
> Yes, the keys were written to 0.6, but when I looked through the thrift
> client code for 0.6, it explicitly converts all string keys to UTF8 before
> sending them over to the server so the encoding *should* be right, and after
> the upgrade to 0.7.5, sstablekeys prints out the correct byte values for
> those keys, but Cassandra itself is unable to get those rows.
>
> I ran some more tests yesterday with a clean database where I only wrote
> two rows, one with an ascii key and one with a unicode key, upgraded to
> 0.7.5, ran nodetool cleanup, and that actually fixed it. After cleanup, the
> server could fetch both rows correctly.
>
> However, when I tried to do the same thing with a snapshot of our live
> database where we have ~2 million keys, out of which ~1000 are unicode,
> cleanup failed with a lot of "Keys must be written in descending order"
> exceptions. I've tried various combinations of cleanup and scrub, running
> cleanup before upgrading, etc, but I've yet to find something that fixes all
> the problems without losing those rows.
>
>
> /Henrik
>
> On Thu, May 5, 2011 at 12:48, aaron morton <aa...@thelastpickle.com>wrote:
>
>> I take it back, the problem started in 0.6 where keys were strings.
>> Looking into how 0.6 did it's thing
>>
>>
>>  -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 5 May 2011, at 22:36, aaron morton wrote:
>>
>> Interesting but as we are dealing with keys it should not matter as they
>> are treated as byte buffers.
>>
>>  -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 5 May 2011, at 04:53, Daniel Doubleday wrote:
>>
>> This is a bit of a wild guess but Windows and encoding and 0.7.5 sounds
>> like
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-2367
>>
>> <https://issues.apache.org/jira/browse/CASSANDRA-2367>
>> On May 3, 2011, at 5:15 PM, Henrik Schröder wrote:
>>
>> Hey everyone,
>>
>> We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7,
>> just to make sure that the change in how keys are encoded wouldn't cause us
>> any dataloss. Unfortunately it seems that rows stored under a unicode key
>> couldn't be retrieved after the upgrade. We're running everything on
>> Windows, and we're using the generated thrift client in C# to access it.
>>
>> I managed to make a minimal test to reproduce the error consistently:
>>
>> First, I started up Cassandra 0.6.13 with an empty data directory, and a
>> really simple config with a single keyspace with a single bytestype
>> columnfamily.
>> I wrote two rows, each with a single column with a simple column name and
>> a 1-byte value of "1". The first row had a key using only ascii chars
>> ('foo'), and the second row had a key using unicode chars ('ドメインウ').
>>
>> Using multi_get, and both those keys, I got both columns back, as
>> expected.
>> Using multi_get_slice and both those keys, I got both columns back, as
>> expected.
>> I also did a get_range_slices to get all rows in the columnfamily, and I
>> got both columns back, as expected.
>>
>> So far so good. Then I drain and shut down Cassandra 0.6.13, and start up
>> Cassandra 0.7.5, pointing to the same data directory, with a config
>> containing the same keyspace, and I run the schematool import command.
>>
>> I then start up my test program that uses the new thrift api, and run some
>> commands.
>>
>> Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I
>> only get back one column, the one under the key 'foo'. The other row I
>> simply can't retrieve.
>>
>> However, when I use get_range_slices to get all rows, I get back two rows,
>> with the correct column values, and the byte-array keys are identical to my
>> encoded keys, and when I decode the byte-arrays as UTF8 drings, I get back
>> my two original keys. This means that both my rows are still there, the keys
>> as output by Cassandra are identical to the original string keys I used when
>> I created the rows in 0.6.13, but it's just impossible to retrieve the
>> second row.
>>
>> To continue the test, I inserted a row with the key 'ドメインウ' encoded as
>> UTF-8 again, and gave it a similar column as the original, but with a 1-byte
>> value of "2".
>>
>> Now, when I use multi_get_slice with my two encoded keys, I get back two
>> rows, the 'foo' row has the old value as expected, and the other row has the
>> new value as expected.
>>
>> However, when I use get_range_slices to get all rows, I get back *three*
>> rows, two of which have the *exact same* byte-array key, one has the old
>> column, one has the new column.
>>
>>
>> How is this possible? How can there be two different rows with the exact
>> same key? I'm guessing that it's related to the encoding of string keys in
>> 0.6, and that the internal representation is off somehow. I checked the
>> generated thrift client for 0.6, and it UTF8-encodes all keys before sending
>> them to the server, so it should be UTF8 all the way, but apparently it
>> isn't.
>>
>> Has anyone else experienced the same problem? Is it a platform-specific
>> problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not
>> lose any rows? I would also really like to know which byte-array I should
>> send in to get back that second row, there's gotta be some key that can be
>> used to get it, the row is still there after all.
>>
>>
>> /Henrik Schröder
>>
>>
>>
>>
>>
>
>

Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

Reply via email to