I want to follow up on the recent "Map phase timeout" thread [2]. In part
out of curiosity and in part as a documentation clean up... Should the
documentation at [1] be changed? Specifically, the docs say MR should be
used:

   - *When you know the set of objects you want to MapReduce over (the
   bucket-key pairs) *(emphasis added)
   - When you want to return actual objects or pieces of the object – not
   just the keys, as do Search & Secondary Indexes
   - When you need utmost flexibility in querying your data. MapReduce
   gives you full access to your object and lets you pick it apart any way you
   want.

It seems to me that a lot of discussions around MR in Riak come down to
"You're close but this isn't the best use case of MapReduce in Riak." Would
it be better, for the purposes of a general discussion, to say that
MapReduce is the appropriate paradigm when you want to:

   - manipulate a large amount of data inside the Riak cluster in bulk -
   e.g. read all of my sales orders and where the version is 1, perform the
   changes necessary to update the order format to version 2.
   - burn a lot of I/O and make your admin sad
   - move data from one bucket to another
   - re-write an entire bucket so all data is indexed for 2i, search, etc
   - Anything where the query can be resumed with no knowledge of state at
   the time the last run of the query failed.

Are there other use cases when MR is the better approach?

[1]:
http://docs.basho.com/riak/latest/tutorials/querying/MapReduce/#When-to-Use-MapReduce
[2]:
http://riak.markmail.org/search/?q=#query:+page:1+mid:4o27v64qf55ejzwc+state:results

---
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to