Strongly suspect that he has invalid unicode characters in his keys. 0.6 wasn't as good at validating those as 0.7.
On Sun, May 8, 2011 at 8:51 PM, aaron morton <aa...@thelastpickle.com> wrote: > Out of interest i've done some more digging. Not sure how much more I've > contributed but here goes... > Ran this against an clean v 0.6.12 and it works (I expected it to fail on > the first read) > client = pycassa.connect() > standard1 = pycassa.ColumnFamily(client, 'Keyspace1', 'Standard1') > uni_str = u"数時間" > uni_str = uni_str.encode("utf-8") > > print "Insert row", uni_str > print uni_str, standard1.insert(uni_str, {"bar" : "baz"}) > print "Read rows" > print "???", standard1.get("???") > print uni_str, standard1.get(uni_str) > Ran that against the current 0.6 head from the command line and it works. > Run against the code running in intelli J and the code fails as expected. > Code also fails as expected on 0.7.5 > At one stage I grabbed the buffer created by fastbinary.encode_binary in the > python generated batch_mutate_args.write() and it looked like the key was > correctly utf-8 encoded (matching bytes to the previous utf-8 encoding of > that string). > I've updated the git > project https://github.com/amorton/cassandra-unicode-bug > Am going to leave it there unless there is interest to keep looking > into it. > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > On 8 May 2011, at 13:31, Jonathan Ellis wrote: > > Right, that's sort of a half-repair: it will repair differences in > replies it got, but it won't doublecheck md5s on the rest in the > background. So if you're doing CL.ONE reads this is a no-op. > > On Sat, May 7, 2011 at 4:25 PM, aaron morton <aa...@thelastpickle.com> > wrote: > > I remembered something like that so had a look at > RangeSliceResponseResolver.resolve() in 0.6.12 and it looks like it > schedules the repairs... > > protected Row getReduced() > > { > > ColumnFamily resolved = > ReadResponseResolver.resolveSuperset(versions); > > ReadResponseResolver.maybeScheduleRepairs(resolved, table, > key, versions, versionSources); > > versions.clear(); > > versionSources.clear(); > > return new Row(key, resolved); > > } > > > Is that right? > > > ----------------- > > Aaron Morton > > Freelance Cassandra Developer > > @aaronmorton > > http://www.thelastpickle.com > > On 8 May 2011, at 00:48, Jonathan Ellis wrote: > > range_slices respects consistencylevel, but only single-row reads and > > multiget do the *repair* part of RR. > > On Sat, May 7, 2011 at 1:44 AM, aaron morton <aa...@thelastpickle.com> > wrote: > > get_range_slices() does read repair if enabled (checked > DoConsistencyChecksBoolean in the config, it's on by default) so you should > be getting good reads. If you want belt-and-braces run nodetool repair > first. > > Hope that helps. > > > On 7 May 2011, at 11:46, Jeremy Hanna wrote: > > Great! I just wanted to make sure you were getting the information you > needed. > > On May 6, 2011, at 6:42 PM, Henrik Schröder wrote: > > Well, I already completed the migration program. Using get_range_slices I > could migrate a few thousand rows per second, which means that migrating all > of our data would take a few minutes, and we'll end up with pristine > datafiles for the new cluster. Problem solved! > > I'll see if I can create datafiles in 0.6 that are uncleanable in 0.7 so > that you all can repeat this and hopefully fix it. > > > /Henrik Schröder > > On Sat, May 7, 2011 at 00:35, Jeremy Hanna <jeremy.hanna1...@gmail.com> > wrote: > > If you're able, go into the #cassandra channel on freenode (IRC) and talk to > driftx or jbellis or aaron_morton about your problem. It could be that you > don't have to do all of this based on a conversation there. > > On May 6, 2011, at 5:04 AM, Henrik Schröder wrote: > > I'll see if I can make some example broken files this weekend. > > > /Henrik Schröder > > On Fri, May 6, 2011 at 02:10, aaron morton <aa...@thelastpickle.com> wrote: > > The difficulty is the different thrift clients between 0.6 and 0.7. > > If you want to roll your own solution I would consider: > > - write an app to talk to 0.6 and pull out the data using keys from the > other system (so you know can check referential integrity while you are at > it). Dump the data to flat file. > > - write an app to talk to 0.7 to load the data back in. > > I've not given up digging on your migration problem, having to manually dump > and reload if you've done nothing wrong is not the best solution. I'll try > to find some time this weekend to test with: > > - 0.6 server, random paritioner, standard CF's, byte column > > - load with python or the cli on osx or ubuntu (dont have a window machine > any more) > > - migrate and see whats going on. > > If you can spare some sample data to load please send it over in the user > group or my email address. > > Cheers > > ----------------- > > Aaron Morton > > Freelance Cassandra Developer > > @aaronmorton > > http://www.thelastpickle.com > > On 6 May 2011, at 05:52, Henrik Schröder wrote: > > We can't do a straight upgrade from 0.6.13 to 0.7.5 because we have rows > stored that have unicode keys, and Cassandra 0.7.5 thinks those rows in the > sstables are corrupt, and it seems impossible to clean it up without losing > data. > > However, we can still read all rows perfectly via thrift so we are now > looking at building a simple tool that will copy all rows from our 0.6.3 > cluster to a parallell 0.7.5 cluster. Our question is now how to do that and > ensure that we actually get all rows migrated? It's a pretty small cluster, > 3 machines, a single keyspace, a singke columnfamily, ~2 million rows, a few > GB of data, and a replication factor of 3. > > So what's the best way? Call get_range_slices and move through the entire > token space? We also have all row keys in a secondary system, would it be > better to use that and make calls to get_multi or get_multi_slices instead? > Are we correct in assuming that if we use the consistencylevel ALL we'll get > all rows? > > > /Henrik Schröder > > > > > > > > > > > -- > > Jonathan Ellis > > Project Chair, Apache Cassandra > > co-founder of DataStax, the source for professional Cassandra support > > http://www.datastax.com > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com