Funny you should ask... I just went through the same exercise.
You must use Cassandra 0.6.4. Otherwise you will get duplicate keys.
However, here is a snippet of perl that you can use.
our $WANTED_COLUMN_NAME = 'mycol';
get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
\%map);
sub get_key_to_one_column_map
{
my ($keyspace, $column_family_name, $super_column_name,
$consistency_level, $returned_keys) = @_;
my($socket, $transport, $protocol, $client, $result, $predicate,
$column_parent, $keyrange);
$column_parent = new Cassandra::ColumnParent();
$column_parent->{'column_family'} = $column_family_name;
$column_parent->{'super_column'} = $super_column_name;
$keyrange = new Cassandra::KeyRange({
'start_key' => '', 'end_key' => '', 'count' => 10
});
$predicate = new Cassandra::SlicePredicate();
$predicate->{'column_names'} = [$WANTED_COLUMN_NAME];
eval
{
$socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
$transport = new Thrift::BufferedTransport($socket, 1024, 1024);
$protocol = new Thrift::BinaryProtocol($transport);
$client = new Cassandra::CassandraClient($protocol);
$transport->open();
my($next_start_key, $one_res, $iteration, $have_more, $value,
$local_count, $previous_start_key);
$iteration = 0;
$have_more = 1;
while ($have_more == 1)
{
$iteration++;
$result = undef;
$result = $client->get_range_slices($keyspace, $column_parent,
$predicate, $keyrange, $consistency_level);
# on success, results is an array of objects.
if (scalar(@$result) == 1)
{
# we only got 1 result... check to see if it's the
# same key as the start key... if so, we're done.
if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
{
$have_more = 0;
last;
}
}
# check to see if we are starting with some value
# if so, we throw away the first result.
if ($keyrange->{'start_key'})
{
shift(@$result);
}
if (scalar(@$result) == 0)
{
$have_more = 0;
last;
}
$previous_start_key = $keyrange->{'start_key'};
$local_count = 0;
for (my $r = 0; $r < scalar(@$result); $r++)
{
$one_res = $result->[$r];
$next_start_key = $one_res->{'key'};
$keyrange->{'start_key'} = $next_start_key;
if (!exists($returned_keys->{$next_start_key}))
{
$have_more = 1;
$local_count++;
}
next if (scalar(@{ $one_res->{'columns'} }) == 0);
$value = undef;
for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
$i++)
{
if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
$WANTED_COLUMN_NAME)
{
$value =
$one_res->{'columns'}->[$i]->{'column'}->{'value'};
if (!exists($returned_keys->{$next_start_key}))
{
$returned_keys->{$next_start_key} = $value;
}
else
{
# NOTE: prior to Cassandra 0.6.4, the
get_range_slices returns duplicates sometimes.
#warn "Found second value for key
[$next_start_key] was [" . $returned_keys->{$next_start_key} . "] now
[$value]!";
}
}
}
$have_more = 1;
} # end results loop
if ($keyrange->{'start_key'} eq $previous_start_key)
{
$have_more = 0;
}
} # end while() loop
$transport->close();
};
if ($@)
{
warn "Problem with Cassandra: " . Dumper($@);
}
# cleanup
undef $client;
undef $protocol;
undef $transport;
undef $socket;
}
HTH
Dave Viner
On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
<[email protected]>wrote:
> Thomas,
>
> That was indeed the source of the problem. I naively assumed that the token
> range would help me avoid retrieving duplicate rows.
>
> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
> tried this morning and I seem to get odd results. Maybe this is just a
> consequence of the random partitioner. I really don't care about the order
> of the iteration, but only each key once and that I see all keys is
> important.
>
> -Adam
>
>
> -----Original Message-----
> From: [email protected] on behalf of Thomas Heller
> Sent: Fri 8/6/2010 7:27 AM
> To: [email protected]
> Subject: Re: error using get_range_slice with random partitioner
>
> Wild guess here, but are you using start_token/end_token here when you
> should be using start_key? Looks to me like you are trying end_token
> = ''.
>
> HTH,
> /thomas
>
> On Thursday, August 5, 2010, Adam Crain <[email protected]>
> wrote:
> > Hi,
> >
> > I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
> that iterating over the keys in keyspace is possible, even with the random
> partitioner. This is mostly desirable in my case for testing purposes only.
> >
> > I get the following error:
> >
> > [junit] Internal error processing get_range_slices
> > [junit] org.apache.thrift.TApplicationException: Internal error
> processing get_range_slices
> >
> > and the following server traceback:
> >
> > java.lang.NumberFormatException: Zero length BigInteger
> > at java.math.BigInteger.<init>(BigInteger.java:295)
> > at java.math.BigInteger.<init>(BigInteger.java:467)
> > at
> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
> > at
> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
> >
> > I am using the scala cascal client, but am sure that get_range_slice is
> being called with start and stop set to "".
> >
> > 1) Is batch iteration possible with random partioner?
> >
> > This isn't clear from the FAQ entry on the subject:
> >
> > http://wiki.apache.org/cassandra/FAQ#iter_world
> >
> > 2) The FAQ states that start argument should be "". What should the end
> argument be?
> >
> > thanks!
> > Adam
> >
> >
> >
> >
> >
> >
>
>
>
>
>