Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

Henrik Schröder Wed, 04 May 2011 08:33:36 -0700

My two keys that I send in my test program are 0xe695b0e69982e99693 and
0x666f6f, which decodes to "数時間" and "foo" respectively.

So I ran my tests again, I started with a clean 0.6.13, wrote two rows with
those two keys, drained, shut down, started 0.7.5, and imported my keyspace.

In my test program, when I do multi_get_slice, I send in those two keys, and
get back a datastructure that contains the exact same keys, but only the
structure under the key 0x666f6f contains any columns.

When I do a simple get with the first key, I get a NotFoundException. The
second key works fine.

Doing get_range_slices, I get back two KeySlices, the keys are the exact
same, and both have their columns.

If I run sstablekeys on the datafile, it prints out:
e695b0e69982e99693
666f6f

If I run sstable2json on the datafile, it prints out:
{
"e695b0e69982e99693": [["00", "01", 1304519723589, false]],
"666f6f": [["00", "01", 1304519721274, false]]
}

After that I re-inserted a row with the first key and then ran my tests
again. Now both single gets work fine, multi_get_slice works fine, but
get_range_slices return a structure with three keys:
0xe695b0e69982e99693
0xe695b0e69982e99693
0x666f6f

I restarted Cassandra to make it flush the commitlog, and my datadirectory
now has two data files. When I run sstablekeys on the first one it still
prints out:
e695b0e69982e99693
666f6f

And running it on the second datafile makes it print out:
e695b0e69982e99693

After all that, I forced a compaction with nodetool and restarted the
server, ending up with a single datafile. When I run sstable2json on that,
it prints out:
{
"e695b0e69982e99693": [["00", "01", 1304519723589, false]],
"e695b0e69982e99693": [["00", "02", 1304521931818, false]],
"666f6f": [["00", "01", 1304519721274, false]]
}

So I now have an SSTable with two rows with identical keys, except one of
the rows doesn't really work? So, now what? And how did I end up in this
state?

/Henrik Schröder

On Tue, May 3, 2011 at 22:10, aaron morton <aa...@thelastpickle.com> wrote:

> Can you provide some details of the data returned from you do the =
> get_range() ? It will be interesting to see the raw bytes returned for =
> the keys. The likely culprit is a change in the encoding. Can you also =
> try to grab the bytes sent for the key when doing the single select that =
> fails.=20
>
> You can grab these either on the client and/or by turing on the logging =
> the DEBUG in conf/log4j-server.properties
>
> Thanks
> Aaron
>
> On 4 May 2011, at 03:19, Henrik Schröder wrote:
>
> > The way we solved this problem is that it turned out we had only a few
> hundred rows with unicode keys, so we simply extracted them, upgraded to
> 0.7, and wrote them back. However, this means that among the rows, there are
> a few hundred weird duplicate rows with identical keys.
> >
> > Is this going to be a problem in the future? Is there a chance that the
> good duplicate is cleaned out in favour of the bad duplicate so that we
> suddnely lose those rows again?
> >
> >
> > /Henrik Schröder
>
>

Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

Reply via email to