Out of interest i've done some more digging. Not sure how much more I've 
contributed but here goes...

Ran this against an clean v 0.6.12 and it works (I expected it to fail on the 
first read)

    client = pycassa.connect()
    standard1 = pycassa.ColumnFamily(client, 'Keyspace1', 'Standard1')

    uni_str = u"数時間"
    uni_str = uni_str.encode("utf-8")
    
    print "Insert row", uni_str
    print uni_str, standard1.insert(uni_str, {"bar" : "baz"})

    print "Read rows"
    print "???", standard1.get("???")
    print uni_str, standard1.get(uni_str)

Ran that against the current 0.6 head from the command line and it works. Run 
against the code running in intelli J and the code fails as expected. Code also 
fails as expected on 0.7.5

At one stage I grabbed the buffer created by fastbinary.encode_binary in the 
python generated batch_mutate_args.write() and it looked like the key was 
correctly utf-8 encoded (matching bytes to the previous utf-8 encoding of that 
string).

I've updated the git project https://github.com/amorton/cassandra-unicode-bug 

Am going to leave it there unless there is interest to keep looking into it. 
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 8 May 2011, at 13:31, Jonathan Ellis wrote:

> Right, that's sort of a half-repair: it will repair differences in
> replies it got, but it won't doublecheck md5s on the rest in the
> background. So if you're doing CL.ONE reads this is a no-op.
> 
> On Sat, May 7, 2011 at 4:25 PM, aaron morton <aa...@thelastpickle.com> wrote:
>> I remembered something like that so had a look at 
>> RangeSliceResponseResolver.resolve()  in 0.6.12 and it looks like it 
>> schedules the repairs...
>> 
>>            protected Row getReduced()
>>            {
>>                ColumnFamily resolved = 
>> ReadResponseResolver.resolveSuperset(versions);
>>                ReadResponseResolver.maybeScheduleRepairs(resolved, table, 
>> key, versions, versionSources);
>>                versions.clear();
>>                versionSources.clear();
>>                return new Row(key, resolved);
>>            }
>> 
>> 
>> Is that right?
>> 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 8 May 2011, at 00:48, Jonathan Ellis wrote:
>> 
>>> range_slices respects consistencylevel, but only single-row reads and
>>> multiget do the *repair* part of RR.
>>> 
>>> On Sat, May 7, 2011 at 1:44 AM, aaron morton <aa...@thelastpickle.com> 
>>> wrote:
>>>> get_range_slices() does read repair if enabled (checked 
>>>> DoConsistencyChecksBoolean in the config, it's on by default) so you 
>>>> should be getting good reads. If you want belt-and-braces run nodetool 
>>>> repair first.
>>>> 
>>>> Hope that helps.
>>>> 
>>>> 
>>>> On 7 May 2011, at 11:46, Jeremy Hanna wrote:
>>>> 
>>>>> Great!  I just wanted to make sure you were getting the information you 
>>>>> needed.
>>>>> 
>>>>> On May 6, 2011, at 6:42 PM, Henrik Schröder wrote:
>>>>> 
>>>>>> Well, I already completed the migration program. Using get_range_slices 
>>>>>> I could migrate a few thousand rows per second, which means that 
>>>>>> migrating all of our data would take a few minutes, and we'll end up 
>>>>>> with pristine datafiles for the new cluster. Problem solved!
>>>>>> 
>>>>>> I'll see if I can create datafiles in 0.6 that are uncleanable in 0.7 so 
>>>>>> that you all can repeat this and hopefully fix it.
>>>>>> 
>>>>>> 
>>>>>> /Henrik Schröder
>>>>>> 
>>>>>> On Sat, May 7, 2011 at 00:35, Jeremy Hanna <jeremy.hanna1...@gmail.com> 
>>>>>> wrote:
>>>>>> If you're able, go into the #cassandra channel on freenode (IRC) and 
>>>>>> talk to driftx or jbellis or aaron_morton about your problem.  It could 
>>>>>> be that you don't have to do all of this based on a conversation there.
>>>>>> 
>>>>>> On May 6, 2011, at 5:04 AM, Henrik Schröder wrote:
>>>>>> 
>>>>>>> I'll see if I can make some example broken files this weekend.
>>>>>>> 
>>>>>>> 
>>>>>>> /Henrik Schröder
>>>>>>> 
>>>>>>> On Fri, May 6, 2011 at 02:10, aaron morton <aa...@thelastpickle.com> 
>>>>>>> wrote:
>>>>>>> The difficulty is the different thrift clients between 0.6 and 0.7.
>>>>>>> 
>>>>>>> If you want to roll your own solution I would consider:
>>>>>>> - write an app to talk to 0.6 and pull out the data using keys from the 
>>>>>>> other system (so you know can check referential integrity while you are 
>>>>>>> at it). Dump the data to flat file.
>>>>>>> - write an app to talk to 0.7 to load the data back in.
>>>>>>> 
>>>>>>> I've not given up digging on your migration problem, having to manually 
>>>>>>> dump and reload if you've done nothing wrong is not the best solution. 
>>>>>>> I'll try to find some time this weekend to test with:
>>>>>>> 
>>>>>>> - 0.6 server, random paritioner, standard CF's, byte column
>>>>>>> - load with python or the cli on osx or ubuntu (dont have a window 
>>>>>>> machine any more)
>>>>>>> - migrate and see whats going on.
>>>>>>> 
>>>>>>> If you can spare some sample data to load please send it over in the 
>>>>>>> user group or my email address.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> -----------------
>>>>>>> Aaron Morton
>>>>>>> Freelance Cassandra Developer
>>>>>>> @aaronmorton
>>>>>>> http://www.thelastpickle.com
>>>>>>> 
>>>>>>> On 6 May 2011, at 05:52, Henrik Schröder wrote:
>>>>>>> 
>>>>>>>> We can't do a straight upgrade from 0.6.13 to 0.7.5 because we have 
>>>>>>>> rows stored that have unicode keys, and Cassandra 0.7.5 thinks those 
>>>>>>>> rows in the sstables are corrupt, and it seems impossible to clean it 
>>>>>>>> up without losing data.
>>>>>>>> 
>>>>>>>> However, we can still read all rows perfectly via thrift so we are now 
>>>>>>>> looking at building a simple tool that will copy all rows from our 
>>>>>>>> 0.6.3 cluster to a parallell 0.7.5 cluster. Our question is now how to 
>>>>>>>> do that and ensure that we actually get all rows migrated? It's a 
>>>>>>>> pretty small cluster, 3 machines, a single keyspace, a singke 
>>>>>>>> columnfamily, ~2 million rows, a few GB of data, and a replication 
>>>>>>>> factor of 3.
>>>>>>>> 
>>>>>>>> So what's the best way? Call get_range_slices and move through the 
>>>>>>>> entire token space? We also have all row keys in a secondary system, 
>>>>>>>> would it be better to use that and make calls to get_multi or 
>>>>>>>> get_multi_slices instead? Are we correct in assuming that if we use 
>>>>>>>> the consistencylevel ALL we'll get all rows?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> /Henrik Schröder
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>> 
>> 
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com

Reply via email to