Re: Correct way to use pbc/mapreduce to do multiget where keys and bucket names are binary values?

Jacques Thu, 02 Jun 2011 16:55:30 -0700

On Thu, Jun 2, 2011 at 3:44 PM, Russell Brown <russell.br...@me.com> wrote:
<<snip>>


> Upon further reflection, I realized that base64 encoding wouldn't work
> unless I stored the values that way single it reads the utf8 as strings
> correctly .  My plan is store my values directly as byte values.  This is
> easy with the protobuf bytes types.  However, how would I then encode them
> in the includes section of the map reduce json block?
>
>
> I'm confused here: I think you are asking how you can add the bucket, key
> inputs to a m/r job when the PB client MapReduceBuilder only allows Strings
> in the
>
> addRiakObject(String bucket, String key)
>
> method. I guess you are asking this since you created the objects as byte[]
> values with
>
> public RiakObject(ByteString bucket, ByteString key, ByteString content)
>
> And stored them with
>
> public void store(RiakObject value)
>
> but I'm guessing.
>

Basically, you are correct.  We have byte arrays that we use for keys today
that are specifically built to be as small as possible.  We were hoping to
use these as is.  We actually moved away from use the current riak protobuf
client because we found it to be somewhat buggy and troublesome to debug
(party due to the use of erlang coding styles).  As users of protobuf
elsewhere, the api was pretty reasonable for us to implement directly.
 Especially since we really only need the high speed protobuf api for put,
get and multiget (by using mr).


> If that is the case then I think the best we can do with the current API is
> to generate a String from your bytes. I guess the ByteString class that the
> PB client uses sends your bytes unmolested, so if you want a String
> representation of your bytes you want to encode them with ISO-8859-1.  Try
>
> addRiakObject(new String(yourBucketBytes, "ISO-8859-1"),  new
> String(yourKeyBytes, "ISO-8859-1"))
>
> When you create the m/r job with the PB MapReduceBuilder.
>
> Does that solve your problem?
>

I'm horrible at character encoding but I don't think so once we use those
strings in the Map Reduce json object.   Unless we remove a bunch of
characters from any character sets, I believe we would choke the json
parsing on the riak side since ultimately the job would be read as a utf8
string.  It seems more likely that we do bidirectional base85 encoding or
something similar (basically mapping reserved characters from to something
else).  But this adds an extra step that we'd like to avoid.  It would also
inflate the memory requirements for our dataset by a third (not exactly.. I
guess assuming 15 byte bucket/key combination, I guess it would take us from
55 to 60 bytes, ~10% growth given key overhead).  We're still not big fans
since we'd prefer not to fill the cluster with 500-600mm values and then try
to move to native bytes/ByteStrings later.


<<snip>>

>
> I've noticed that there is a secondary erlang format that can be passed for
> map reduce jobs, must I use that?  If so, does anyone have an example of a
> generating one of these from within Java?
>
>
> The java PB client doesn't currently support
> the application/x-erlang-binary content-type for map/reduce jobs. I think
> that only the erlang pb client does.
>

I understand that the current client doesn't support this.  I was more
thinking that of using
Jinterface<http://www.erlang.org/doc/apps/jinterface/java/com/ericsson/otp/erlang/package-summary.html>
to
generate the erlang version of the map reduce job.  I haven't worked with it
and really don't have any knowledge around erlang types but figured it might
be possible. I guess the question was whether anybody thought this was
feasible.

<<snip>>

> 2. Are the request and response threads in Riak separate or sequential.
>  For example, if I send 5 normal PbcGetReq requests in quick succession on a
> single socket does Riak finish the first one before starting on requests
> 2-5?  Or does it rather thread the requests out as they come in so it will
> get 2-5 simultaneously?  I'm asking this because I'm trying to figure out
> how much I should try to reuse a single socket connection.
>
>
> Talking about the java pb client? Each thread gets its own socket. If you
> want to do 5 concurrent gets, create 5 threads, pass a pb riak client to
> each, and have each thread do a get, you will get 5 open sockets and 5
> concurrent gets. If those threads do more operations (within a second (there
> is a Timer thread reaping connections that are unused for 1 second)) they
> will reuse the same connection.
>

I was asking on the Riak side. Our real need is multi-get.  We're looking
for regular pulls of 20 random bucket/key values.  We need to minimize
the latency each of the pulls.  I know that we could split this into 20
separate sockets. (ick... This gets ugly when we're talking about many
simultaneous pulls from multiple servers.  I'd rather not create pools of
100 sockets per requesting server if I could avoid it.)   I was wondering
whether if we sent four requests in a row to riak on the same socket,
whether it would work on them all at once or serially.  I'm guessing
serially.


> I hope I've gone someway to answering your questions.
>

Yes, I appreciate the help you're providing. Thank you.

We're currently utilizing Linked-in's Project Voldemort but were hoping to
transition to Riak because our primary goal is to minimize disk seeks.  I
know multi-get was previously considered and rejected due to "lack of
demand".  I was hoping we could utilize the map-reduce functionality to
proxy this functionality.


>
> I'd like to get your map/reduce query working and it seems you've hit a
> genuine blind spot in the current API  (storing a value with a byte[]
> bucket/key but the MapReduceBuilder requires Strings for bucket/key) so I
> want to find a work around and make sure that it is in the next version of
> the API. Thanks for your patience: If you could send me some code that
> reproduces your problem (github gist is ideal for this) then it'd make it
> easier.
>
>
I'm sure I can make it work utilizing the path you referenced above and
base85 or base64.  It is a simple return all values map job.  Based on what
you're saying it seems like we have three options:

- Use a string based binary encoding for our bucket/key names (e.g. base64)
- Use a truck load of sockets.  (How will Riak perform if we are generating
let's say 200-500 connections per riak node?)
- Try to figure out encoding an erlang version of the map reduce job using
JInterface (assuming the MapReduce api supports binary buckets and keys if
using an erlang content-type).

Am I missing any options?  I'm inclined to see if option 3 is available
unless someone says that is a fool's dream.

Thanks,
Jacques

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Correct way to use pbc/mapreduce to do multiget where keys and bucket names are binary values?

Reply via email to