Out of interest i've done some more digging. Not sure how much more I've contributed but here goes...
Ran this against an clean v 0.6.12 and it works (I expected it to fail on the first read) client = pycassa.connect() standard1 = pycassa.ColumnFamily(client, 'Keyspace1', 'Standard1') uni_str = u"数時間" uni_str = uni_str.encode("utf-8") print "Insert row", uni_str print uni_str, standard1.insert(uni_str, {"bar" : "baz"}) print "Read rows" print "???", standard1.get("???") print uni_str, standard1.get(uni_str) Ran that against the current 0.6 head from the command line and it works. Run against the code running in intelli J and the code fails as expected. Code also fails as expected on 0.7.5 At one stage I grabbed the buffer created by fastbinary.encode_binary in the python generated batch_mutate_args.write() and it looked like the key was correctly utf-8 encoded (matching bytes to the previous utf-8 encoding of that string). I've updated the git project https://github.com/amorton/cassandra-unicode-bug Am going to leave it there unless there is interest to keep looking into it. ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 8 May 2011, at 13:31, Jonathan Ellis wrote: > Right, that's sort of a half-repair: it will repair differences in > replies it got, but it won't doublecheck md5s on the rest in the > background. So if you're doing CL.ONE reads this is a no-op. > > On Sat, May 7, 2011 at 4:25 PM, aaron morton <aa...@thelastpickle.com> wrote: >> I remembered something like that so had a look at >> RangeSliceResponseResolver.resolve() in 0.6.12 and it looks like it >> schedules the repairs... >> >> protected Row getReduced() >> { >> ColumnFamily resolved = >> ReadResponseResolver.resolveSuperset(versions); >> ReadResponseResolver.maybeScheduleRepairs(resolved, table, >> key, versions, versionSources); >> versions.clear(); >> versionSources.clear(); >> return new Row(key, resolved); >> } >> >> >> Is that right? >> >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 8 May 2011, at 00:48, Jonathan Ellis wrote: >> >>> range_slices respects consistencylevel, but only single-row reads and >>> multiget do the *repair* part of RR. >>> >>> On Sat, May 7, 2011 at 1:44 AM, aaron morton <aa...@thelastpickle.com> >>> wrote: >>>> get_range_slices() does read repair if enabled (checked >>>> DoConsistencyChecksBoolean in the config, it's on by default) so you >>>> should be getting good reads. If you want belt-and-braces run nodetool >>>> repair first. >>>> >>>> Hope that helps. >>>> >>>> >>>> On 7 May 2011, at 11:46, Jeremy Hanna wrote: >>>> >>>>> Great! I just wanted to make sure you were getting the information you >>>>> needed. >>>>> >>>>> On May 6, 2011, at 6:42 PM, Henrik Schröder wrote: >>>>> >>>>>> Well, I already completed the migration program. Using get_range_slices >>>>>> I could migrate a few thousand rows per second, which means that >>>>>> migrating all of our data would take a few minutes, and we'll end up >>>>>> with pristine datafiles for the new cluster. Problem solved! >>>>>> >>>>>> I'll see if I can create datafiles in 0.6 that are uncleanable in 0.7 so >>>>>> that you all can repeat this and hopefully fix it. >>>>>> >>>>>> >>>>>> /Henrik Schröder >>>>>> >>>>>> On Sat, May 7, 2011 at 00:35, Jeremy Hanna <jeremy.hanna1...@gmail.com> >>>>>> wrote: >>>>>> If you're able, go into the #cassandra channel on freenode (IRC) and >>>>>> talk to driftx or jbellis or aaron_morton about your problem. It could >>>>>> be that you don't have to do all of this based on a conversation there. >>>>>> >>>>>> On May 6, 2011, at 5:04 AM, Henrik Schröder wrote: >>>>>> >>>>>>> I'll see if I can make some example broken files this weekend. >>>>>>> >>>>>>> >>>>>>> /Henrik Schröder >>>>>>> >>>>>>> On Fri, May 6, 2011 at 02:10, aaron morton <aa...@thelastpickle.com> >>>>>>> wrote: >>>>>>> The difficulty is the different thrift clients between 0.6 and 0.7. >>>>>>> >>>>>>> If you want to roll your own solution I would consider: >>>>>>> - write an app to talk to 0.6 and pull out the data using keys from the >>>>>>> other system (so you know can check referential integrity while you are >>>>>>> at it). Dump the data to flat file. >>>>>>> - write an app to talk to 0.7 to load the data back in. >>>>>>> >>>>>>> I've not given up digging on your migration problem, having to manually >>>>>>> dump and reload if you've done nothing wrong is not the best solution. >>>>>>> I'll try to find some time this weekend to test with: >>>>>>> >>>>>>> - 0.6 server, random paritioner, standard CF's, byte column >>>>>>> - load with python or the cli on osx or ubuntu (dont have a window >>>>>>> machine any more) >>>>>>> - migrate and see whats going on. >>>>>>> >>>>>>> If you can spare some sample data to load please send it over in the >>>>>>> user group or my email address. >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> ----------------- >>>>>>> Aaron Morton >>>>>>> Freelance Cassandra Developer >>>>>>> @aaronmorton >>>>>>> http://www.thelastpickle.com >>>>>>> >>>>>>> On 6 May 2011, at 05:52, Henrik Schröder wrote: >>>>>>> >>>>>>>> We can't do a straight upgrade from 0.6.13 to 0.7.5 because we have >>>>>>>> rows stored that have unicode keys, and Cassandra 0.7.5 thinks those >>>>>>>> rows in the sstables are corrupt, and it seems impossible to clean it >>>>>>>> up without losing data. >>>>>>>> >>>>>>>> However, we can still read all rows perfectly via thrift so we are now >>>>>>>> looking at building a simple tool that will copy all rows from our >>>>>>>> 0.6.3 cluster to a parallell 0.7.5 cluster. Our question is now how to >>>>>>>> do that and ensure that we actually get all rows migrated? It's a >>>>>>>> pretty small cluster, 3 machines, a single keyspace, a singke >>>>>>>> columnfamily, ~2 million rows, a few GB of data, and a replication >>>>>>>> factor of 3. >>>>>>>> >>>>>>>> So what's the best way? Call get_range_slices and move through the >>>>>>>> entire token space? We also have all row keys in a secondary system, >>>>>>>> would it be better to use that and make calls to get_multi or >>>>>>>> get_multi_slices instead? Are we correct in assuming that if we use >>>>>>>> the consistencylevel ALL we'll get all rows? >>>>>>>> >>>>>>>> >>>>>>>> /Henrik Schröder >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> >>> >>> -- >>> Jonathan Ellis >>> Project Chair, Apache Cassandra >>> co-founder of DataStax, the source for professional Cassandra support >>> http://www.datastax.com >> >> > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com