That was my initial thought, just wanted to see if there was anything else going on. Sounds like Henrik has a workaround so all is well.
Cheers ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 9 May 2011, at 18:10, Jonathan Ellis wrote: > Strongly suspect that he has invalid unicode characters in his keys. > 0.6 wasn't as good at validating those as 0.7. > > On Sun, May 8, 2011 at 8:51 PM, aaron morton <aa...@thelastpickle.com> wrote: >> Out of interest i've done some more digging. Not sure how much more I've >> contributed but here goes... >> Ran this against an clean v 0.6.12 and it works (I expected it to fail on >> the first read) >> client = pycassa.connect() >> standard1 = pycassa.ColumnFamily(client, 'Keyspace1', 'Standard1') >> uni_str = u"数時間" >> uni_str = uni_str.encode("utf-8") >> >> print "Insert row", uni_str >> print uni_str, standard1.insert(uni_str, {"bar" : "baz"}) >> print "Read rows" >> print "???", standard1.get("???") >> print uni_str, standard1.get(uni_str) >> Ran that against the current 0.6 head from the command line and it works. >> Run against the code running in intelli J and the code fails as expected. >> Code also fails as expected on 0.7.5 >> At one stage I grabbed the buffer created by fastbinary.encode_binary in the >> python generated batch_mutate_args.write() and it looked like the key was >> correctly utf-8 encoded (matching bytes to the previous utf-8 encoding of >> that string). >> I've updated the git >> project https://github.com/amorton/cassandra-unicode-bug >> Am going to leave it there unless there is interest to keep looking >> into it. >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >> On 8 May 2011, at 13:31, Jonathan Ellis wrote: >> >> Right, that's sort of a half-repair: it will repair differences in >> replies it got, but it won't doublecheck md5s on the rest in the >> background. So if you're doing CL.ONE reads this is a no-op. >> >> On Sat, May 7, 2011 at 4:25 PM, aaron morton <aa...@thelastpickle.com> >> wrote: >> >> I remembered something like that so had a look at >> RangeSliceResponseResolver.resolve() in 0.6.12 and it looks like it >> schedules the repairs... >> >> protected Row getReduced() >> >> { >> >> ColumnFamily resolved = >> ReadResponseResolver.resolveSuperset(versions); >> >> ReadResponseResolver.maybeScheduleRepairs(resolved, table, >> key, versions, versionSources); >> >> versions.clear(); >> >> versionSources.clear(); >> >> return new Row(key, resolved); >> >> } >> >> >> Is that right? >> >> >> ----------------- >> >> Aaron Morton >> >> Freelance Cassandra Developer >> >> @aaronmorton >> >> http://www.thelastpickle.com >> >> On 8 May 2011, at 00:48, Jonathan Ellis wrote: >> >> range_slices respects consistencylevel, but only single-row reads and >> >> multiget do the *repair* part of RR. >> >> On Sat, May 7, 2011 at 1:44 AM, aaron morton <aa...@thelastpickle.com> >> wrote: >> >> get_range_slices() does read repair if enabled (checked >> DoConsistencyChecksBoolean in the config, it's on by default) so you should >> be getting good reads. If you want belt-and-braces run nodetool repair >> first. >> >> Hope that helps. >> >> >> On 7 May 2011, at 11:46, Jeremy Hanna wrote: >> >> Great! I just wanted to make sure you were getting the information you >> needed. >> >> On May 6, 2011, at 6:42 PM, Henrik Schröder wrote: >> >> Well, I already completed the migration program. Using get_range_slices I >> could migrate a few thousand rows per second, which means that migrating all >> of our data would take a few minutes, and we'll end up with pristine >> datafiles for the new cluster. Problem solved! >> >> I'll see if I can create datafiles in 0.6 that are uncleanable in 0.7 so >> that you all can repeat this and hopefully fix it. >> >> >> /Henrik Schröder >> >> On Sat, May 7, 2011 at 00:35, Jeremy Hanna <jeremy.hanna1...@gmail.com> >> wrote: >> >> If you're able, go into the #cassandra channel on freenode (IRC) and talk to >> driftx or jbellis or aaron_morton about your problem. It could be that you >> don't have to do all of this based on a conversation there. >> >> On May 6, 2011, at 5:04 AM, Henrik Schröder wrote: >> >> I'll see if I can make some example broken files this weekend. >> >> >> /Henrik Schröder >> >> On Fri, May 6, 2011 at 02:10, aaron morton <aa...@thelastpickle.com> wrote: >> >> The difficulty is the different thrift clients between 0.6 and 0.7. >> >> If you want to roll your own solution I would consider: >> >> - write an app to talk to 0.6 and pull out the data using keys from the >> other system (so you know can check referential integrity while you are at >> it). Dump the data to flat file. >> >> - write an app to talk to 0.7 to load the data back in. >> >> I've not given up digging on your migration problem, having to manually dump >> and reload if you've done nothing wrong is not the best solution. I'll try >> to find some time this weekend to test with: >> >> - 0.6 server, random paritioner, standard CF's, byte column >> >> - load with python or the cli on osx or ubuntu (dont have a window machine >> any more) >> >> - migrate and see whats going on. >> >> If you can spare some sample data to load please send it over in the user >> group or my email address. >> >> Cheers >> >> ----------------- >> >> Aaron Morton >> >> Freelance Cassandra Developer >> >> @aaronmorton >> >> http://www.thelastpickle.com >> >> On 6 May 2011, at 05:52, Henrik Schröder wrote: >> >> We can't do a straight upgrade from 0.6.13 to 0.7.5 because we have rows >> stored that have unicode keys, and Cassandra 0.7.5 thinks those rows in the >> sstables are corrupt, and it seems impossible to clean it up without losing >> data. >> >> However, we can still read all rows perfectly via thrift so we are now >> looking at building a simple tool that will copy all rows from our 0.6.3 >> cluster to a parallell 0.7.5 cluster. Our question is now how to do that and >> ensure that we actually get all rows migrated? It's a pretty small cluster, >> 3 machines, a single keyspace, a singke columnfamily, ~2 million rows, a few >> GB of data, and a replication factor of 3. >> >> So what's the best way? Call get_range_slices and move through the entire >> token space? We also have all row keys in a secondary system, would it be >> better to use that and make calls to get_multi or get_multi_slices instead? >> Are we correct in assuming that if we use the consistencylevel ALL we'll get >> all rows? >> >> >> /Henrik Schröder >> >> >> >> >> >> >> >> >> >> >> -- >> >> Jonathan Ellis >> >> Project Chair, Apache Cassandra >> >> co-founder of DataStax, the source for professional Cassandra support >> >> http://www.datastax.com >> >> >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of DataStax, the source for professional Cassandra support >> http://www.datastax.com >> >> > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com