RE: error using get_range_slice with random partitioner

Adam Crain Fri, 06 Aug 2010 08:31:49 -0700

Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just 
discovered that the client I'm using mutates the order of keys after retrieving 
the result with the thrift API... pretty much making key iteration impossible. 
So time to fork and see if they'll fix it :(.


I'll review yours as soon as I get the client fixed that I'm using.

Adam


-----Original Message-----
From: [email protected] on behalf of Dave Viner
Sent: Fri 8/6/2010 11:28 AM
To: [email protected]
Subject: Re: error using get_range_slice with random partitioner
 
Funny you should ask... I just went through the same exercise.

You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
 However, here is a snippet of perl that you can use.

our $WANTED_COLUMN_NAME = 'mycol';
get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
\%map);

sub get_key_to_one_column_map
{
    my ($keyspace, $column_family_name, $super_column_name,
$consistency_level, $returned_keys) = @_;


    my($socket, $transport, $protocol, $client, $result, $predicate,
$column_parent, $keyrange);

    $column_parent = new Cassandra::ColumnParent();
    $column_parent->{'column_family'} = $column_family_name;
    $column_parent->{'super_column'} = $super_column_name;

    $keyrange = new Cassandra::KeyRange({
            'start_key' => '', 'end_key' => '', 'count' => 10
    });


    $predicate = new Cassandra::SlicePredicate();
    $predicate->{'column_names'} = [$WANTED_COLUMN_NAME];

    eval
    {
        $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
        $transport = new Thrift::BufferedTransport($socket, 1024, 1024);
        $protocol = new Thrift::BinaryProtocol($transport);
        $client = new Cassandra::CassandraClient($protocol);
        $transport->open();


        my($next_start_key, $one_res, $iteration, $have_more, $value,
$local_count, $previous_start_key);

        $iteration = 0;
        $have_more = 1;
        while ($have_more == 1)
        {
            $iteration++;
            $result = undef;

            $result = $client->get_range_slices($keyspace, $column_parent,
$predicate, $keyrange, $consistency_level);

            # on success, results is an array of objects.

            if (scalar(@$result) == 1)
            {
                # we only got 1 result... check to see if it's the
                # same key as the start key... if so, we're done.
                if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
                {
                    $have_more = 0;
                    last;
                }
            }

            # check to see if we are starting with some value
            # if so, we throw away the first result.
            if ($keyrange->{'start_key'})
            {
                shift(@$result);
            }
            if (scalar(@$result) == 0)
            {
                $have_more = 0;
                last;
            }

            $previous_start_key = $keyrange->{'start_key'};
            $local_count = 0;

            for (my $r = 0; $r < scalar(@$result); $r++)
            {
                $one_res = $result->[$r];
                $next_start_key = $one_res->{'key'};

                $keyrange->{'start_key'} = $next_start_key;

                if (!exists($returned_keys->{$next_start_key}))
                {
                    $have_more = 1;
                    $local_count++;
                }


                next if (scalar(@{ $one_res->{'columns'} }) == 0);

                $value = undef;

                for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
$i++)
                {
                    if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
$WANTED_COLUMN_NAME)
                    {
                        $value =
$one_res->{'columns'}->[$i]->{'column'}->{'value'};
                        if (!exists($returned_keys->{$next_start_key}))
                        {
                            $returned_keys->{$next_start_key} = $value;
                        }
                        else
                        {
                            # NOTE: prior to Cassandra 0.6.4, the
get_range_slices returns duplicates sometimes.
                            #warn "Found second value for key
[$next_start_key]  was [" . $returned_keys->{$next_start_key} . "] now
[$value]!";
                        }
                    }
                }
                $have_more = 1;
            } # end results loop

            if ($keyrange->{'start_key'} eq $previous_start_key)
            {
                $have_more = 0;
            }

        } # end while() loop

        $transport->close();
    };
    if ($@)
    {
        warn "Problem with Cassandra: " . Dumper($@);
    }

    # cleanup
    undef $client;
    undef $protocol;
    undef $transport;
    undef $socket;
}


HTH
Dave Viner

On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
<[email protected]>wrote:

> Thomas,
>
> That was indeed the source of the problem. I naively assumed that the token
> range would help me avoid retrieving duplicate rows.
>
> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
> tried this morning and I seem to get odd results. Maybe this is just a
> consequence of the random partitioner. I really don't care about the order
> of the iteration, but only each key once and that I see all keys is
> important.
>
> -Adam
>
>
> -----Original Message-----
> From: [email protected] on behalf of Thomas Heller
> Sent: Fri 8/6/2010 7:27 AM
> To: [email protected]
> Subject: Re: error using get_range_slice with random partitioner
>
> Wild guess here, but are you using start_token/end_token here when you
> should be using start_key? Looks to me like you are trying end_token
> = ''.
>
> HTH,
> /thomas
>
> On Thursday, August 5, 2010, Adam Crain <[email protected]>
> wrote:
> > Hi,
> >
> > I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
> that iterating over the keys in keyspace is possible, even with the random
> partitioner. This is mostly desirable in my case for testing purposes only.
> >
> > I get the following error:
> >
> > [junit] Internal error processing get_range_slices
> > [junit] org.apache.thrift.TApplicationException: Internal error
> processing get_range_slices
> >
> > and the following server traceback:
> >
> > java.lang.NumberFormatException: Zero length BigInteger
> >         at java.math.BigInteger.<init>(BigInteger.java:295)
> >         at java.math.BigInteger.<init>(BigInteger.java:467)
> >         at
> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
> >         at
> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
> >
> > I am using the scala cascal client, but am sure that get_range_slice is
> being called with start and stop set to "".
> >
> > 1) Is batch iteration possible with random partioner?
> >
> > This isn't clear from the FAQ entry on the subject:
> >
> > http://wiki.apache.org/cassandra/FAQ#iter_world
> >
> > 2) The FAQ states that start argument should be "". What should the end
> argument be?
> >
> > thanks!
> > Adam
> >
> >
> >
> >
> >
> >
>
>
>
>
>

<<winmail.dat>>

RE: error using get_range_slice with random partitioner

Reply via email to