RE: error using get_range_slice with random partitioner

Adam Crain Fri, 06 Aug 2010 15:06:40 -0700

I ran against the 0.6 branch I still see similarly odd results. My test cases 
prove that set of keys have been successfully inserted, but usually I never see 
the first key again or I reach the first key before having seen all of the keys.


-Adam



-----Original Message-----
From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com]
Sent: Fri 8/6/2010 4:25 PM
To: user@cassandra.apache.org
Subject: Re: error using get_range_slice with random partitioner
 
If you're willing to try it out, the easiest way to check to see if it is 
resolved by the patch for CASSANDRA-1145, you could checkout the 0.6 branch:

svn checkout http://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.6/ 
cassandra-0.6

Then run `ant` to build the binaries.

On Aug 6, 2010, at 2:57 PM, Adam Crain wrote:

> Hi Jeremy,
> 
> So, I fixed my client so it preserves the ordering and I get results that may 
> be related to the bug.
> 
> If I insert 30 keys into the random partitioner with names [key1, key2, ... 
> key30] and then start the iteration (with a batch size of 10) I get the 
> following debug output during the iteration:
> 
> [junit] Query w/ Range(,,10) result size: 10
> [junit] key18
> [junit] key23
> [junit] key26
> [junit] key27
> [junit] key12
> [junit] key28
> [junit] key4
> [junit] key3
> [junit] key1
> [junit] key24
> [junit] Query w/ Range(key24,,10) result size: 10
> [junit] key24
> [junit] key5
> [junit] key17
> [junit] key29
> [junit] key19
> [junit] key8
> [junit] key15
> [junit] key22
> [junit] key6
> [junit] key25
> [junit] Query w/ Range(key25,,10) result size: 3
> [junit] key25
> [junit] key14
> [junit] key2
> [junit] Query w/ Range(key2,,10), result size: 1
> [junit] key2
> 
> I never make it back around to key 18 as expected, and I never see all of the 
> keys.
> 
> -Adam
> 
> -----Original Message-----
> From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com]
> Sent: Fri 8/6/2010 11:45 AM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
> 
> Sounds like what you're seeing is in the client, but there was another 
> duplicate bug with get_range_slice that was recently fixed on cassandra-0.6 
> branch.  It's slated for 0.6.5 which will probably be out sometime this 
> month, based on previous minor releases.
> 
> https://issues.apache.org/jira/browse/CASSANDRA-1145
> 
> On Aug 6, 2010, at 10:29 AM, Adam Crain wrote:
> 
>> Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just 
>> discovered that the client I'm using mutates the order of keys after 
>> retrieving the result with the thrift API... pretty much making key 
>> iteration impossible. So time to fork and see if they'll fix it :(.
>> 
>> I'll review yours as soon as I get the client fixed that I'm using.
>> 
>> Adam
>> 
>> 
>> -----Original Message-----
>> From: davevi...@gmail.com on behalf of Dave Viner
>> Sent: Fri 8/6/2010 11:28 AM
>> To: user@cassandra.apache.org
>> Subject: Re: error using get_range_slice with random partitioner
>> 
>> Funny you should ask... I just went through the same exercise.
>> 
>> You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
>> However, here is a snippet of perl that you can use.
>> 
>> our $WANTED_COLUMN_NAME = 'mycol';
>> get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
>> \%map);
>> 
>> sub get_key_to_one_column_map
>> {
>>   my ($keyspace, $column_family_name, $super_column_name,
>> $consistency_level, $returned_keys) = @_;
>> 
>> 
>>   my($socket, $transport, $protocol, $client, $result, $predicate,
>> $column_parent, $keyrange);
>> 
>>   $column_parent = new Cassandra::ColumnParent();
>>   $column_parent->{'column_family'} = $column_family_name;
>>   $column_parent->{'super_column'} = $super_column_name;
>> 
>>   $keyrange = new Cassandra::KeyRange({
>>           'start_key' => '', 'end_key' => '', 'count' => 10
>>   });
>> 
>> 
>>   $predicate = new Cassandra::SlicePredicate();
>>   $predicate->{'column_names'} = [$WANTED_COLUMN_NAME];
>> 
>>   eval
>>   {
>>       $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
>>       $transport = new Thrift::BufferedTransport($socket, 1024, 1024);
>>       $protocol = new Thrift::BinaryProtocol($transport);
>>       $client = new Cassandra::CassandraClient($protocol);
>>       $transport->open();
>> 
>> 
>>       my($next_start_key, $one_res, $iteration, $have_more, $value,
>> $local_count, $previous_start_key);
>> 
>>       $iteration = 0;
>>       $have_more = 1;
>>       while ($have_more == 1)
>>       {
>>           $iteration++;
>>           $result = undef;
>> 
>>           $result = $client->get_range_slices($keyspace, $column_parent,
>> $predicate, $keyrange, $consistency_level);
>> 
>>           # on success, results is an array of objects.
>> 
>>           if (scalar(@$result) == 1)
>>           {
>>               # we only got 1 result... check to see if it's the
>>               # same key as the start key... if so, we're done.
>>               if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
>>               {
>>                   $have_more = 0;
>>                   last;
>>               }
>>           }
>> 
>>           # check to see if we are starting with some value
>>           # if so, we throw away the first result.
>>           if ($keyrange->{'start_key'})
>>           {
>>               shift(@$result);
>>           }
>>           if (scalar(@$result) == 0)
>>           {
>>               $have_more = 0;
>>               last;
>>           }
>> 
>>           $previous_start_key = $keyrange->{'start_key'};
>>           $local_count = 0;
>> 
>>           for (my $r = 0; $r < scalar(@$result); $r++)
>>           {
>>               $one_res = $result->[$r];
>>               $next_start_key = $one_res->{'key'};
>> 
>>               $keyrange->{'start_key'} = $next_start_key;
>> 
>>               if (!exists($returned_keys->{$next_start_key}))
>>               {
>>                   $have_more = 1;
>>                   $local_count++;
>>               }
>> 
>> 
>>               next if (scalar(@{ $one_res->{'columns'} }) == 0);
>> 
>>               $value = undef;
>> 
>>               for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
>> $i++)
>>               {
>>                   if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
>> $WANTED_COLUMN_NAME)
>>                   {
>>                       $value =
>> $one_res->{'columns'}->[$i]->{'column'}->{'value'};
>>                       if (!exists($returned_keys->{$next_start_key}))
>>                       {
>>                           $returned_keys->{$next_start_key} = $value;
>>                       }
>>                       else
>>                       {
>>                           # NOTE: prior to Cassandra 0.6.4, the
>> get_range_slices returns duplicates sometimes.
>>                           #warn "Found second value for key
>> [$next_start_key]  was [" . $returned_keys->{$next_start_key} . "] now
>> [$value]!";
>>                       }
>>                   }
>>               }
>>               $have_more = 1;
>>           } # end results loop
>> 
>>           if ($keyrange->{'start_key'} eq $previous_start_key)
>>           {
>>               $have_more = 0;
>>           }
>> 
>>       } # end while() loop
>> 
>>       $transport->close();
>>   };
>>   if ($@)
>>   {
>>       warn "Problem with Cassandra: " . Dumper($@);
>>   }
>> 
>>   # cleanup
>>   undef $client;
>>   undef $protocol;
>>   undef $transport;
>>   undef $socket;
>> }
>> 
>> 
>> HTH
>> Dave Viner
>> 
>> On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
>> <adam.cr...@greenenergycorp.com>wrote:
>> 
>>> Thomas,
>>> 
>>> That was indeed the source of the problem. I naively assumed that the token
>>> range would help me avoid retrieving duplicate rows.
>>> 
>>> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
>>> tried this morning and I seem to get odd results. Maybe this is just a
>>> consequence of the random partitioner. I really don't care about the order
>>> of the iteration, but only each key once and that I see all keys is
>>> important.
>>> 
>>> -Adam
>>> 
>>> 
>>> -----Original Message-----
>>> From: th.hel...@gmail.com on behalf of Thomas Heller
>>> Sent: Fri 8/6/2010 7:27 AM
>>> To: user@cassandra.apache.org
>>> Subject: Re: error using get_range_slice with random partitioner
>>> 
>>> Wild guess here, but are you using start_token/end_token here when you
>>> should be using start_key? Looks to me like you are trying end_token
>>> = ''.
>>> 
>>> HTH,
>>> /thomas
>>> 
>>> On Thursday, August 5, 2010, Adam Crain <adam.cr...@greenenergycorp.com>
>>> wrote:
>>>> Hi,
>>>> 
>>>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
>>> that iterating over the keys in keyspace is possible, even with the random
>>> partitioner. This is mostly desirable in my case for testing purposes only.
>>>> 
>>>> I get the following error:
>>>> 
>>>> [junit] Internal error processing get_range_slices
>>>> [junit] org.apache.thrift.TApplicationException: Internal error
>>> processing get_range_slices
>>>> 
>>>> and the following server traceback:
>>>> 
>>>> java.lang.NumberFormatException: Zero length BigInteger
>>>>       at java.math.BigInteger.<init>(BigInteger.java:295)
>>>>       at java.math.BigInteger.<init>(BigInteger.java:467)
>>>>       at
>>> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>>>>       at
>>> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>>>> 
>>>> I am using the scala cascal client, but am sure that get_range_slice is
>>> being called with start and stop set to "".
>>>> 
>>>> 1) Is batch iteration possible with random partioner?
>>>> 
>>>> This isn't clear from the FAQ entry on the subject:
>>>> 
>>>> http://wiki.apache.org/cassandra/FAQ#iter_world
>>>> 
>>>> 2) The FAQ states that start argument should be "". What should the end
>>> argument be?
>>>> 
>>>> thanks!
>>>> Adam
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> <winmail.dat>
> 
> 
> 
> 
> 
> <winmail.dat>

<<winmail.dat>>

RE: error using get_range_slice with random partitioner

Reply via email to