On Thu, Jun 2, 2011 at 3:44 PM, Russell Brown <russell.br...@me.com> wrote: <<snip>>
> Upon further reflection, I realized that base64 encoding wouldn't work > unless I stored the values that way single it reads the utf8 as strings > correctly . My plan is store my values directly as byte values. This is > easy with the protobuf bytes types. However, how would I then encode them > in the includes section of the map reduce json block? > > > I'm confused here: I think you are asking how you can add the bucket, key > inputs to a m/r job when the PB client MapReduceBuilder only allows Strings > in the > > addRiakObject(String bucket, String key) > > method. I guess you are asking this since you created the objects as byte[] > values with > > public RiakObject(ByteString bucket, ByteString key, ByteString content) > > And stored them with > > public void store(RiakObject value) > > but I'm guessing. > Basically, you are correct. We have byte arrays that we use for keys today that are specifically built to be as small as possible. We were hoping to use these as is. We actually moved away from use the current riak protobuf client because we found it to be somewhat buggy and troublesome to debug (party due to the use of erlang coding styles). As users of protobuf elsewhere, the api was pretty reasonable for us to implement directly. Especially since we really only need the high speed protobuf api for put, get and multiget (by using mr). > If that is the case then I think the best we can do with the current API is > to generate a String from your bytes. I guess the ByteString class that the > PB client uses sends your bytes unmolested, so if you want a String > representation of your bytes you want to encode them with ISO-8859-1. Try > > addRiakObject(new String(yourBucketBytes, "ISO-8859-1"), new > String(yourKeyBytes, "ISO-8859-1")) > > When you create the m/r job with the PB MapReduceBuilder. > > Does that solve your problem? > I'm horrible at character encoding but I don't think so once we use those strings in the Map Reduce json object. Unless we remove a bunch of characters from any character sets, I believe we would choke the json parsing on the riak side since ultimately the job would be read as a utf8 string. It seems more likely that we do bidirectional base85 encoding or something similar (basically mapping reserved characters from to something else). But this adds an extra step that we'd like to avoid. It would also inflate the memory requirements for our dataset by a third (not exactly.. I guess assuming 15 byte bucket/key combination, I guess it would take us from 55 to 60 bytes, ~10% growth given key overhead). We're still not big fans since we'd prefer not to fill the cluster with 500-600mm values and then try to move to native bytes/ByteStrings later. <<snip>> > > I've noticed that there is a secondary erlang format that can be passed for > map reduce jobs, must I use that? If so, does anyone have an example of a > generating one of these from within Java? > > > The java PB client doesn't currently support > the application/x-erlang-binary content-type for map/reduce jobs. I think > that only the erlang pb client does. > I understand that the current client doesn't support this. I was more thinking that of using Jinterface<http://www.erlang.org/doc/apps/jinterface/java/com/ericsson/otp/erlang/package-summary.html> to generate the erlang version of the map reduce job. I haven't worked with it and really don't have any knowledge around erlang types but figured it might be possible. I guess the question was whether anybody thought this was feasible. <<snip>> > 2. Are the request and response threads in Riak separate or sequential. > For example, if I send 5 normal PbcGetReq requests in quick succession on a > single socket does Riak finish the first one before starting on requests > 2-5? Or does it rather thread the requests out as they come in so it will > get 2-5 simultaneously? I'm asking this because I'm trying to figure out > how much I should try to reuse a single socket connection. > > > Talking about the java pb client? Each thread gets its own socket. If you > want to do 5 concurrent gets, create 5 threads, pass a pb riak client to > each, and have each thread do a get, you will get 5 open sockets and 5 > concurrent gets. If those threads do more operations (within a second (there > is a Timer thread reaping connections that are unused for 1 second)) they > will reuse the same connection. > I was asking on the Riak side. Our real need is multi-get. We're looking for regular pulls of 20 random bucket/key values. We need to minimize the latency each of the pulls. I know that we could split this into 20 separate sockets. (ick... This gets ugly when we're talking about many simultaneous pulls from multiple servers. I'd rather not create pools of 100 sockets per requesting server if I could avoid it.) I was wondering whether if we sent four requests in a row to riak on the same socket, whether it would work on them all at once or serially. I'm guessing serially. > I hope I've gone someway to answering your questions. > Yes, I appreciate the help you're providing. Thank you. We're currently utilizing Linked-in's Project Voldemort but were hoping to transition to Riak because our primary goal is to minimize disk seeks. I know multi-get was previously considered and rejected due to "lack of demand". I was hoping we could utilize the map-reduce functionality to proxy this functionality. > > I'd like to get your map/reduce query working and it seems you've hit a > genuine blind spot in the current API (storing a value with a byte[] > bucket/key but the MapReduceBuilder requires Strings for bucket/key) so I > want to find a work around and make sure that it is in the next version of > the API. Thanks for your patience: If you could send me some code that > reproduces your problem (github gist is ideal for this) then it'd make it > easier. > > I'm sure I can make it work utilizing the path you referenced above and base85 or base64. It is a simple return all values map job. Based on what you're saying it seems like we have three options: - Use a string based binary encoding for our bucket/key names (e.g. base64) - Use a truck load of sockets. (How will Riak perform if we are generating let's say 200-500 connections per riak node?) - Try to figure out encoding an erlang version of the map reduce job using JInterface (assuming the MapReduce api supports binary buckets and keys if using an erlang content-type). Am I missing any options? I'm inclined to see if option 3 is available unless someone says that is a fool's dream. Thanks, Jacques
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com