Funny you should ask... I just went through the same exercise. You must use Cassandra 0.6.4. Otherwise you will get duplicate keys. However, here is a snippet of perl that you can use.
our $WANTED_COLUMN_NAME = 'mycol'; get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM, \%map); sub get_key_to_one_column_map { my ($keyspace, $column_family_name, $super_column_name, $consistency_level, $returned_keys) = @_; my($socket, $transport, $protocol, $client, $result, $predicate, $column_parent, $keyrange); $column_parent = new Cassandra::ColumnParent(); $column_parent->{'column_family'} = $column_family_name; $column_parent->{'super_column'} = $super_column_name; $keyrange = new Cassandra::KeyRange({ 'start_key' => '', 'end_key' => '', 'count' => 10 }); $predicate = new Cassandra::SlicePredicate(); $predicate->{'column_names'} = [$WANTED_COLUMN_NAME]; eval { $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT); $transport = new Thrift::BufferedTransport($socket, 1024, 1024); $protocol = new Thrift::BinaryProtocol($transport); $client = new Cassandra::CassandraClient($protocol); $transport->open(); my($next_start_key, $one_res, $iteration, $have_more, $value, $local_count, $previous_start_key); $iteration = 0; $have_more = 1; while ($have_more == 1) { $iteration++; $result = undef; $result = $client->get_range_slices($keyspace, $column_parent, $predicate, $keyrange, $consistency_level); # on success, results is an array of objects. if (scalar(@$result) == 1) { # we only got 1 result... check to see if it's the # same key as the start key... if so, we're done. if ($result->[0]->{'key'} eq $keyrange->{'start_key'}) { $have_more = 0; last; } } # check to see if we are starting with some value # if so, we throw away the first result. if ($keyrange->{'start_key'}) { shift(@$result); } if (scalar(@$result) == 0) { $have_more = 0; last; } $previous_start_key = $keyrange->{'start_key'}; $local_count = 0; for (my $r = 0; $r < scalar(@$result); $r++) { $one_res = $result->[$r]; $next_start_key = $one_res->{'key'}; $keyrange->{'start_key'} = $next_start_key; if (!exists($returned_keys->{$next_start_key})) { $have_more = 1; $local_count++; } next if (scalar(@{ $one_res->{'columns'} }) == 0); $value = undef; for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} }); $i++) { if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq $WANTED_COLUMN_NAME) { $value = $one_res->{'columns'}->[$i]->{'column'}->{'value'}; if (!exists($returned_keys->{$next_start_key})) { $returned_keys->{$next_start_key} = $value; } else { # NOTE: prior to Cassandra 0.6.4, the get_range_slices returns duplicates sometimes. #warn "Found second value for key [$next_start_key] was [" . $returned_keys->{$next_start_key} . "] now [$value]!"; } } } $have_more = 1; } # end results loop if ($keyrange->{'start_key'} eq $previous_start_key) { $have_more = 0; } } # end while() loop $transport->close(); }; if ($@) { warn "Problem with Cassandra: " . Dumper($@); } # cleanup undef $client; undef $protocol; undef $transport; undef $socket; } HTH Dave Viner On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain <adam.cr...@greenenergycorp.com>wrote: > Thomas, > > That was indeed the source of the problem. I naively assumed that the token > range would help me avoid retrieving duplicate rows. > > If you iterate over the keys, how do you avoid retrieving duplicate keys? I > tried this morning and I seem to get odd results. Maybe this is just a > consequence of the random partitioner. I really don't care about the order > of the iteration, but only each key once and that I see all keys is > important. > > -Adam > > > -----Original Message----- > From: th.hel...@gmail.com on behalf of Thomas Heller > Sent: Fri 8/6/2010 7:27 AM > To: user@cassandra.apache.org > Subject: Re: error using get_range_slice with random partitioner > > Wild guess here, but are you using start_token/end_token here when you > should be using start_key? Looks to me like you are trying end_token > = ''. > > HTH, > /thomas > > On Thursday, August 5, 2010, Adam Crain <adam.cr...@greenenergycorp.com> > wrote: > > Hi, > > > > I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated > that iterating over the keys in keyspace is possible, even with the random > partitioner. This is mostly desirable in my case for testing purposes only. > > > > I get the following error: > > > > [junit] Internal error processing get_range_slices > > [junit] org.apache.thrift.TApplicationException: Internal error > processing get_range_slices > > > > and the following server traceback: > > > > java.lang.NumberFormatException: Zero length BigInteger > > at java.math.BigInteger.<init>(BigInteger.java:295) > > at java.math.BigInteger.<init>(BigInteger.java:467) > > at > org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100) > > at > org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575) > > > > I am using the scala cascal client, but am sure that get_range_slice is > being called with start and stop set to "". > > > > 1) Is batch iteration possible with random partioner? > > > > This isn't clear from the FAQ entry on the subject: > > > > http://wiki.apache.org/cassandra/FAQ#iter_world > > > > 2) The FAQ states that start argument should be "". What should the end > argument be? > > > > thanks! > > Adam > > > > > > > > > > > > > > > > >