Sounds like what you're seeing is in the client, but there was another 
duplicate bug with get_range_slice that was recently fixed on cassandra-0.6 
branch.  It's slated for 0.6.5 which will probably be out sometime this month, 
based on previous minor releases.

https://issues.apache.org/jira/browse/CASSANDRA-1145

On Aug 6, 2010, at 10:29 AM, Adam Crain wrote:

> Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just 
> discovered that the client I'm using mutates the order of keys after 
> retrieving the result with the thrift API... pretty much making key iteration 
> impossible. So time to fork and see if they'll fix it :(.
> 
> I'll review yours as soon as I get the client fixed that I'm using.
> 
> Adam
> 
> 
> -----Original Message-----
> From: davevi...@gmail.com on behalf of Dave Viner
> Sent: Fri 8/6/2010 11:28 AM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
> 
> Funny you should ask... I just went through the same exercise.
> 
> You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
> However, here is a snippet of perl that you can use.
> 
> our $WANTED_COLUMN_NAME = 'mycol';
> get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
> \%map);
> 
> sub get_key_to_one_column_map
> {
>    my ($keyspace, $column_family_name, $super_column_name,
> $consistency_level, $returned_keys) = @_;
> 
> 
>    my($socket, $transport, $protocol, $client, $result, $predicate,
> $column_parent, $keyrange);
> 
>    $column_parent = new Cassandra::ColumnParent();
>    $column_parent->{'column_family'} = $column_family_name;
>    $column_parent->{'super_column'} = $super_column_name;
> 
>    $keyrange = new Cassandra::KeyRange({
>            'start_key' => '', 'end_key' => '', 'count' => 10
>    });
> 
> 
>    $predicate = new Cassandra::SlicePredicate();
>    $predicate->{'column_names'} = [$WANTED_COLUMN_NAME];
> 
>    eval
>    {
>        $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
>        $transport = new Thrift::BufferedTransport($socket, 1024, 1024);
>        $protocol = new Thrift::BinaryProtocol($transport);
>        $client = new Cassandra::CassandraClient($protocol);
>        $transport->open();
> 
> 
>        my($next_start_key, $one_res, $iteration, $have_more, $value,
> $local_count, $previous_start_key);
> 
>        $iteration = 0;
>        $have_more = 1;
>        while ($have_more == 1)
>        {
>            $iteration++;
>            $result = undef;
> 
>            $result = $client->get_range_slices($keyspace, $column_parent,
> $predicate, $keyrange, $consistency_level);
> 
>            # on success, results is an array of objects.
> 
>            if (scalar(@$result) == 1)
>            {
>                # we only got 1 result... check to see if it's the
>                # same key as the start key... if so, we're done.
>                if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
>                {
>                    $have_more = 0;
>                    last;
>                }
>            }
> 
>            # check to see if we are starting with some value
>            # if so, we throw away the first result.
>            if ($keyrange->{'start_key'})
>            {
>                shift(@$result);
>            }
>            if (scalar(@$result) == 0)
>            {
>                $have_more = 0;
>                last;
>            }
> 
>            $previous_start_key = $keyrange->{'start_key'};
>            $local_count = 0;
> 
>            for (my $r = 0; $r < scalar(@$result); $r++)
>            {
>                $one_res = $result->[$r];
>                $next_start_key = $one_res->{'key'};
> 
>                $keyrange->{'start_key'} = $next_start_key;
> 
>                if (!exists($returned_keys->{$next_start_key}))
>                {
>                    $have_more = 1;
>                    $local_count++;
>                }
> 
> 
>                next if (scalar(@{ $one_res->{'columns'} }) == 0);
> 
>                $value = undef;
> 
>                for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
> $i++)
>                {
>                    if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
> $WANTED_COLUMN_NAME)
>                    {
>                        $value =
> $one_res->{'columns'}->[$i]->{'column'}->{'value'};
>                        if (!exists($returned_keys->{$next_start_key}))
>                        {
>                            $returned_keys->{$next_start_key} = $value;
>                        }
>                        else
>                        {
>                            # NOTE: prior to Cassandra 0.6.4, the
> get_range_slices returns duplicates sometimes.
>                            #warn "Found second value for key
> [$next_start_key]  was [" . $returned_keys->{$next_start_key} . "] now
> [$value]!";
>                        }
>                    }
>                }
>                $have_more = 1;
>            } # end results loop
> 
>            if ($keyrange->{'start_key'} eq $previous_start_key)
>            {
>                $have_more = 0;
>            }
> 
>        } # end while() loop
> 
>        $transport->close();
>    };
>    if ($@)
>    {
>        warn "Problem with Cassandra: " . Dumper($@);
>    }
> 
>    # cleanup
>    undef $client;
>    undef $protocol;
>    undef $transport;
>    undef $socket;
> }
> 
> 
> HTH
> Dave Viner
> 
> On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
> <adam.cr...@greenenergycorp.com>wrote:
> 
>> Thomas,
>> 
>> That was indeed the source of the problem. I naively assumed that the token
>> range would help me avoid retrieving duplicate rows.
>> 
>> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
>> tried this morning and I seem to get odd results. Maybe this is just a
>> consequence of the random partitioner. I really don't care about the order
>> of the iteration, but only each key once and that I see all keys is
>> important.
>> 
>> -Adam
>> 
>> 
>> -----Original Message-----
>> From: th.hel...@gmail.com on behalf of Thomas Heller
>> Sent: Fri 8/6/2010 7:27 AM
>> To: user@cassandra.apache.org
>> Subject: Re: error using get_range_slice with random partitioner
>> 
>> Wild guess here, but are you using start_token/end_token here when you
>> should be using start_key? Looks to me like you are trying end_token
>> = ''.
>> 
>> HTH,
>> /thomas
>> 
>> On Thursday, August 5, 2010, Adam Crain <adam.cr...@greenenergycorp.com>
>> wrote:
>>> Hi,
>>> 
>>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
>> that iterating over the keys in keyspace is possible, even with the random
>> partitioner. This is mostly desirable in my case for testing purposes only.
>>> 
>>> I get the following error:
>>> 
>>> [junit] Internal error processing get_range_slices
>>> [junit] org.apache.thrift.TApplicationException: Internal error
>> processing get_range_slices
>>> 
>>> and the following server traceback:
>>> 
>>> java.lang.NumberFormatException: Zero length BigInteger
>>>        at java.math.BigInteger.<init>(BigInteger.java:295)
>>>        at java.math.BigInteger.<init>(BigInteger.java:467)
>>>        at
>> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>>>        at
>> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>>> 
>>> I am using the scala cascal client, but am sure that get_range_slice is
>> being called with start and stop set to "".
>>> 
>>> 1) Is batch iteration possible with random partioner?
>>> 
>>> This isn't clear from the FAQ entry on the subject:
>>> 
>>> http://wiki.apache.org/cassandra/FAQ#iter_world
>>> 
>>> 2) The FAQ states that start argument should be "". What should the end
>> argument be?
>>> 
>>> thanks!
>>> Adam
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
> 
> <winmail.dat>

Reply via email to