Sounds like what you're seeing is in the client, but there was another duplicate bug with get_range_slice that was recently fixed on cassandra-0.6 branch. It's slated for 0.6.5 which will probably be out sometime this month, based on previous minor releases.
https://issues.apache.org/jira/browse/CASSANDRA-1145 On Aug 6, 2010, at 10:29 AM, Adam Crain wrote: > Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just > discovered that the client I'm using mutates the order of keys after > retrieving the result with the thrift API... pretty much making key iteration > impossible. So time to fork and see if they'll fix it :(. > > I'll review yours as soon as I get the client fixed that I'm using. > > Adam > > > -----Original Message----- > From: davevi...@gmail.com on behalf of Dave Viner > Sent: Fri 8/6/2010 11:28 AM > To: user@cassandra.apache.org > Subject: Re: error using get_range_slice with random partitioner > > Funny you should ask... I just went through the same exercise. > > You must use Cassandra 0.6.4. Otherwise you will get duplicate keys. > However, here is a snippet of perl that you can use. > > our $WANTED_COLUMN_NAME = 'mycol'; > get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM, > \%map); > > sub get_key_to_one_column_map > { > my ($keyspace, $column_family_name, $super_column_name, > $consistency_level, $returned_keys) = @_; > > > my($socket, $transport, $protocol, $client, $result, $predicate, > $column_parent, $keyrange); > > $column_parent = new Cassandra::ColumnParent(); > $column_parent->{'column_family'} = $column_family_name; > $column_parent->{'super_column'} = $super_column_name; > > $keyrange = new Cassandra::KeyRange({ > 'start_key' => '', 'end_key' => '', 'count' => 10 > }); > > > $predicate = new Cassandra::SlicePredicate(); > $predicate->{'column_names'} = [$WANTED_COLUMN_NAME]; > > eval > { > $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT); > $transport = new Thrift::BufferedTransport($socket, 1024, 1024); > $protocol = new Thrift::BinaryProtocol($transport); > $client = new Cassandra::CassandraClient($protocol); > $transport->open(); > > > my($next_start_key, $one_res, $iteration, $have_more, $value, > $local_count, $previous_start_key); > > $iteration = 0; > $have_more = 1; > while ($have_more == 1) > { > $iteration++; > $result = undef; > > $result = $client->get_range_slices($keyspace, $column_parent, > $predicate, $keyrange, $consistency_level); > > # on success, results is an array of objects. > > if (scalar(@$result) == 1) > { > # we only got 1 result... check to see if it's the > # same key as the start key... if so, we're done. > if ($result->[0]->{'key'} eq $keyrange->{'start_key'}) > { > $have_more = 0; > last; > } > } > > # check to see if we are starting with some value > # if so, we throw away the first result. > if ($keyrange->{'start_key'}) > { > shift(@$result); > } > if (scalar(@$result) == 0) > { > $have_more = 0; > last; > } > > $previous_start_key = $keyrange->{'start_key'}; > $local_count = 0; > > for (my $r = 0; $r < scalar(@$result); $r++) > { > $one_res = $result->[$r]; > $next_start_key = $one_res->{'key'}; > > $keyrange->{'start_key'} = $next_start_key; > > if (!exists($returned_keys->{$next_start_key})) > { > $have_more = 1; > $local_count++; > } > > > next if (scalar(@{ $one_res->{'columns'} }) == 0); > > $value = undef; > > for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} }); > $i++) > { > if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq > $WANTED_COLUMN_NAME) > { > $value = > $one_res->{'columns'}->[$i]->{'column'}->{'value'}; > if (!exists($returned_keys->{$next_start_key})) > { > $returned_keys->{$next_start_key} = $value; > } > else > { > # NOTE: prior to Cassandra 0.6.4, the > get_range_slices returns duplicates sometimes. > #warn "Found second value for key > [$next_start_key] was [" . $returned_keys->{$next_start_key} . "] now > [$value]!"; > } > } > } > $have_more = 1; > } # end results loop > > if ($keyrange->{'start_key'} eq $previous_start_key) > { > $have_more = 0; > } > > } # end while() loop > > $transport->close(); > }; > if ($@) > { > warn "Problem with Cassandra: " . Dumper($@); > } > > # cleanup > undef $client; > undef $protocol; > undef $transport; > undef $socket; > } > > > HTH > Dave Viner > > On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain > <adam.cr...@greenenergycorp.com>wrote: > >> Thomas, >> >> That was indeed the source of the problem. I naively assumed that the token >> range would help me avoid retrieving duplicate rows. >> >> If you iterate over the keys, how do you avoid retrieving duplicate keys? I >> tried this morning and I seem to get odd results. Maybe this is just a >> consequence of the random partitioner. I really don't care about the order >> of the iteration, but only each key once and that I see all keys is >> important. >> >> -Adam >> >> >> -----Original Message----- >> From: th.hel...@gmail.com on behalf of Thomas Heller >> Sent: Fri 8/6/2010 7:27 AM >> To: user@cassandra.apache.org >> Subject: Re: error using get_range_slice with random partitioner >> >> Wild guess here, but are you using start_token/end_token here when you >> should be using start_key? Looks to me like you are trying end_token >> = ''. >> >> HTH, >> /thomas >> >> On Thursday, August 5, 2010, Adam Crain <adam.cr...@greenenergycorp.com> >> wrote: >>> Hi, >>> >>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated >> that iterating over the keys in keyspace is possible, even with the random >> partitioner. This is mostly desirable in my case for testing purposes only. >>> >>> I get the following error: >>> >>> [junit] Internal error processing get_range_slices >>> [junit] org.apache.thrift.TApplicationException: Internal error >> processing get_range_slices >>> >>> and the following server traceback: >>> >>> java.lang.NumberFormatException: Zero length BigInteger >>> at java.math.BigInteger.<init>(BigInteger.java:295) >>> at java.math.BigInteger.<init>(BigInteger.java:467) >>> at >> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100) >>> at >> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575) >>> >>> I am using the scala cascal client, but am sure that get_range_slice is >> being called with start and stop set to "". >>> >>> 1) Is batch iteration possible with random partioner? >>> >>> This isn't clear from the FAQ entry on the subject: >>> >>> http://wiki.apache.org/cassandra/FAQ#iter_world >>> >>> 2) The FAQ states that start argument should be "". What should the end >> argument be? >>> >>> thanks! >>> Adam >>> >>> >>> >>> >>> >>> >> >> >> >> >> > > <winmail.dat>