Re: error using get_range_slice with random partitioner

Thomas Heller Fri, 06 Aug 2010 15:35:59 -0700

Hey,

[junit] key24
[junit] Query w/ Range(key24,,10) result size: 10
[junit] key24


I think this is actually the expected result, whenever you are using
range_slices with start_key/end_key you must increment the last key
you received and then use that in the next slice start_key. I also
tried to use token because of exactly that behaviour and the doc
talking about inclusive/exclusive.

Tokens are actually what the Partitioner uses to decide which nodes
your data goes to, so in case of RPP it the the MD5 hash of your
actual key as a 128bit BigInteger (just try nodetool ring to see some
Tokens ;). get_range_slices with start/end_token is best used together
with describe_ring/describe_splits so you can talk to the nodes
directly. The Hadoop/Pig stuff uses tokens for example.


HTH,
/thomas

On Sat, Aug 7, 2010 at 12:06 AM, Adam Crain
<adam.cr...@greenenergycorp.com> wrote:
> I ran against the 0.6 branch I still see similarly odd results. My test cases 
> prove that set of keys have been successfully inserted, but usually I never 
> see the first key again or I reach the first key before having seen all of 
> the keys.
>
> -Adam
>
>
>
> -----Original Message-----
> From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com]
> Sent: Fri 8/6/2010 4:25 PM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
>
> If you're willing to try it out, the easiest way to check to see if it is 
> resolved by the patch for CASSANDRA-1145, you could checkout the 0.6 branch:
>
> svn checkout 
> http://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.6/ 
> cassandra-0.6
>
> Then run `ant` to build the binaries.
>
> On Aug 6, 2010, at 2:57 PM, Adam Crain wrote:
>
>> Hi Jeremy,
>>
>> So, I fixed my client so it preserves the ordering and I get results that 
>> may be related to the bug.
>>
>> If I insert 30 keys into the random partitioner with names [key1, key2, ... 
>> key30] and then start the iteration (with a batch size of 10) I get the 
>> following debug output during the iteration:
>>
>> [junit] Query w/ Range(,,10) result size: 10
>> [junit] key18
>> [junit] key23
>> [junit] key26
>> [junit] key27
>> [junit] key12
>> [junit] key28
>> [junit] key4
>> [junit] key3
>> [junit] key1
>> [junit] key24
>> [junit] Query w/ Range(key24,,10) result size: 10
>> [junit] key24
>> [junit] key5
>> [junit] key17
>> [junit] key29
>> [junit] key19
>> [junit] key8
>> [junit] key15
>> [junit] key22
>> [junit] key6
>> [junit] key25
>> [junit] Query w/ Range(key25,,10) result size: 3
>> [junit] key25
>> [junit] key14
>> [junit] key2
>> [junit] Query w/ Range(key2,,10), result size: 1
>> [junit] key2
>>
>> I never make it back around to key 18 as expected, and I never see all of 
>> the keys.
>>
>> -Adam
>>
>> -----Original Message-----
>> From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com]
>> Sent: Fri 8/6/2010 11:45 AM
>> To: user@cassandra.apache.org
>> Subject: Re: error using get_range_slice with random partitioner
>>
>> Sounds like what you're seeing is in the client, but there was another 
>> duplicate bug with get_range_slice that was recently fixed on cassandra-0.6 
>> branch.  It's slated for 0.6.5 which will probably be out sometime this 
>> month, based on previous minor releases.
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-1145
>>
>> On Aug 6, 2010, at 10:29 AM, Adam Crain wrote:
>>
>>> Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just 
>>> discovered that the client I'm using mutates the order of keys after 
>>> retrieving the result with the thrift API... pretty much making key 
>>> iteration impossible. So time to fork and see if they'll fix it :(.
>>>
>>> I'll review yours as soon as I get the client fixed that I'm using.
>>>
>>> Adam
>>>
>>>
>>> -----Original Message-----
>>> From: davevi...@gmail.com on behalf of Dave Viner
>>> Sent: Fri 8/6/2010 11:28 AM
>>> To: user@cassandra.apache.org
>>> Subject: Re: error using get_range_slice with random partitioner
>>>
>>> Funny you should ask... I just went through the same exercise.
>>>
>>> You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
>>> However, here is a snippet of perl that you can use.
>>>
>>> our $WANTED_COLUMN_NAME = 'mycol';
>>> get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
>>> \%map);
>>>
>>> sub get_key_to_one_column_map
>>> {
>>>   my ($keyspace, $column_family_name, $super_column_name,
>>> $consistency_level, $returned_keys) = @_;
>>>
>>>
>>>   my($socket, $transport, $protocol, $client, $result, $predicate,
>>> $column_parent, $keyrange);
>>>
>>>   $column_parent = new Cassandra::ColumnParent();
>>>   $column_parent->{'column_family'} = $column_family_name;
>>>   $column_parent->{'super_column'} = $super_column_name;
>>>
>>>   $keyrange = new Cassandra::KeyRange({
>>>           'start_key' => '', 'end_key' => '', 'count' => 10
>>>   });
>>>
>>>
>>>   $predicate = new Cassandra::SlicePredicate();
>>>   $predicate->{'column_names'} = [$WANTED_COLUMN_NAME];
>>>
>>>   eval
>>>   {
>>>       $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
>>>       $transport = new Thrift::BufferedTransport($socket, 1024, 1024);
>>>       $protocol = new Thrift::BinaryProtocol($transport);
>>>       $client = new Cassandra::CassandraClient($protocol);
>>>       $transport->open();
>>>
>>>
>>>       my($next_start_key, $one_res, $iteration, $have_more, $value,
>>> $local_count, $previous_start_key);
>>>
>>>       $iteration = 0;
>>>       $have_more = 1;
>>>       while ($have_more == 1)
>>>       {
>>>           $iteration++;
>>>           $result = undef;
>>>
>>>           $result = $client->get_range_slices($keyspace, $column_parent,
>>> $predicate, $keyrange, $consistency_level);
>>>
>>>           # on success, results is an array of objects.
>>>
>>>           if (scalar(@$result) == 1)
>>>           {
>>>               # we only got 1 result... check to see if it's the
>>>               # same key as the start key... if so, we're done.
>>>               if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
>>>               {
>>>                   $have_more = 0;
>>>                   last;
>>>               }
>>>           }
>>>
>>>           # check to see if we are starting with some value
>>>           # if so, we throw away the first result.
>>>           if ($keyrange->{'start_key'})
>>>           {
>>>               shift(@$result);
>>>           }
>>>           if (scalar(@$result) == 0)
>>>           {
>>>               $have_more = 0;
>>>               last;
>>>           }
>>>
>>>           $previous_start_key = $keyrange->{'start_key'};
>>>           $local_count = 0;
>>>
>>>           for (my $r = 0; $r < scalar(@$result); $r++)
>>>           {
>>>               $one_res = $result->[$r];
>>>               $next_start_key = $one_res->{'key'};
>>>
>>>               $keyrange->{'start_key'} = $next_start_key;
>>>
>>>               if (!exists($returned_keys->{$next_start_key}))
>>>               {
>>>                   $have_more = 1;
>>>                   $local_count++;
>>>               }
>>>
>>>
>>>               next if (scalar(@{ $one_res->{'columns'} }) == 0);
>>>
>>>               $value = undef;
>>>
>>>               for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
>>> $i++)
>>>               {
>>>                   if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
>>> $WANTED_COLUMN_NAME)
>>>                   {
>>>                       $value =
>>> $one_res->{'columns'}->[$i]->{'column'}->{'value'};
>>>                       if (!exists($returned_keys->{$next_start_key}))
>>>                       {
>>>                           $returned_keys->{$next_start_key} = $value;
>>>                       }
>>>                       else
>>>                       {
>>>                           # NOTE: prior to Cassandra 0.6.4, the
>>> get_range_slices returns duplicates sometimes.
>>>                           #warn "Found second value for key
>>> [$next_start_key]  was [" . $returned_keys->{$next_start_key} . "] now
>>> [$value]!";
>>>                       }
>>>                   }
>>>               }
>>>               $have_more = 1;
>>>           } # end results loop
>>>
>>>           if ($keyrange->{'start_key'} eq $previous_start_key)
>>>           {
>>>               $have_more = 0;
>>>           }
>>>
>>>       } # end while() loop
>>>
>>>       $transport->close();
>>>   };
>>>   if ($@)
>>>   {
>>>       warn "Problem with Cassandra: " . Dumper($@);
>>>   }
>>>
>>>   # cleanup
>>>   undef $client;
>>>   undef $protocol;
>>>   undef $transport;
>>>   undef $socket;
>>> }
>>>
>>>
>>> HTH
>>> Dave Viner
>>>
>>> On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
>>> <adam.cr...@greenenergycorp.com>wrote:
>>>
>>>> Thomas,
>>>>
>>>> That was indeed the source of the problem. I naively assumed that the token
>>>> range would help me avoid retrieving duplicate rows.
>>>>
>>>> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
>>>> tried this morning and I seem to get odd results. Maybe this is just a
>>>> consequence of the random partitioner. I really don't care about the order
>>>> of the iteration, but only each key once and that I see all keys is
>>>> important.
>>>>
>>>> -Adam
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: th.hel...@gmail.com on behalf of Thomas Heller
>>>> Sent: Fri 8/6/2010 7:27 AM
>>>> To: user@cassandra.apache.org
>>>> Subject: Re: error using get_range_slice with random partitioner
>>>>
>>>> Wild guess here, but are you using start_token/end_token here when you
>>>> should be using start_key? Looks to me like you are trying end_token
>>>> = ''.
>>>>
>>>> HTH,
>>>> /thomas
>>>>
>>>> On Thursday, August 5, 2010, Adam Crain <adam.cr...@greenenergycorp.com>
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
>>>> that iterating over the keys in keyspace is possible, even with the random
>>>> partitioner. This is mostly desirable in my case for testing purposes only.
>>>>>
>>>>> I get the following error:
>>>>>
>>>>> [junit] Internal error processing get_range_slices
>>>>> [junit] org.apache.thrift.TApplicationException: Internal error
>>>> processing get_range_slices
>>>>>
>>>>> and the following server traceback:
>>>>>
>>>>> java.lang.NumberFormatException: Zero length BigInteger
>>>>>       at java.math.BigInteger.<init>(BigInteger.java:295)
>>>>>       at java.math.BigInteger.<init>(BigInteger.java:467)
>>>>>       at
>>>> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>>>>>       at
>>>> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>>>>>
>>>>> I am using the scala cascal client, but am sure that get_range_slice is
>>>> being called with start and stop set to "".
>>>>>
>>>>> 1) Is batch iteration possible with random partioner?
>>>>>
>>>>> This isn't clear from the FAQ entry on the subject:
>>>>>
>>>>> http://wiki.apache.org/cassandra/FAQ#iter_world
>>>>>
>>>>> 2) The FAQ states that start argument should be "". What should the end
>>>> argument be?
>>>>>
>>>>> thanks!
>>>>> Adam
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> <winmail.dat>
>>
>>
>>
>>
>>
>> <winmail.dat>
>
>
>
>
>
>

Re: error using get_range_slice with random partitioner

Reply via email to