I think it might not be much of a problem to not repeat Maps... From this example on their site:

---------------------------------------------------------------
from disco.core import Job, result_iterator

def map(line, params):
  for word in line.split():
    yield word, 1

def reduce(iter, params):
  from disco.util import kvgroup
  for word, counts in kvgroup(sorted(iter)):
    yield word, sum(counts)

if __name__ == '__main__':
  input = ["http://discoproject.org/media/text/chekhov.txt"]
  job = Job().run(input=input, map=map, reduce=reduce)
  for word, count in result_iterator(job.wait()):
    print word, count
---------------------------------------------------------------

I can imagine putting the URL to an index on the __main__ section to get an array of keys... split by lines the array of keys in plain text and send it to the maps functions... i deduce from the example that each map represents a line in the input... in which i will do a Riak GET of that key ( line ) and select the data i want to reduce from that object to send it to the reduce function...

Example:
RIAK INDEX: get all sales of today
MAP: get a key, check if the customer is a woman between 18 and 25 years old, then return 1 ( any other return 0 )
REDUCE: sum all the 1s to give a total counter

Riak MR can do it also, but for what we saw in the list some time ago, is not wise to use MR for this can of operation ( moreover on demand basis )

I still have to try it and see... but it would be a nice way to do a distributed multiGET and reduce the data to a result without hustling Riak...

Thanks,
Rohman

On 17.04.2013 16:19, Jens Rantil wrote:

Hi,

 

I've been following the Disco Project for a couple of years. The tricky part with using Disco with Riak would be to make sure each map phase is not executed multiple times over the same data*. Also, since each map phase would (preferably) run on the same host as its data (for data locality), you would also have to make sure to only iterate over data that is associated with the vnode for that physical host.

 

If you can easily extract host-specific keys for a specific vnode, then this is doable. However, either the Disco master or the Disco job submitter will need to have all this data when a job is submitted.

 

Also, I'm not sure that it will help very much that both are written in Erlang.

 

Some ideas,

Jens

 

* Obviously, you could also chain your mapreduce jobs in Disco to remove duplicate maps, but this introduces overhead.

 

Från: riak-users [mailto:riak-users-boun...@lists.basho.com] För Antonio Rohman Fernandez
Skickat: den 17 april 2013 13:15
Till: riak-users@lists.basho.com
Ämne: Riak + Disco (MapReduce alternative)

 

Hello everybody,

Has anyone tried to use Riak with Disco? [ http://discoproject.org ] I was looking for Hadoop alternatives ( as the RIAK-HADOOP connector project seems not going anywhere ) and I think Disco is quite interesting, moreover is written in Erlang same as Riak. Looks like it would be a good match!

As seen in the mailing list, seems that Riak's built-in MapReduce is not suitable for much of the queries I would be interested on doing... My idea would be to leverage the MapReduce work to a Hadoop ( or Disco, or another ) cluster that will do the GETs on the Riak cluster through an Index ( as suggested on this list... do multi-gets instead of MR ) and reduce the data independently. Does anybody has suggestions about this?

Thanks,
Rohman

line

logo

 

Antonio Rohman Fernandez
CEO, Founder & Lead Engineer
roh...@mahalostudio.com

 

Projects
MaruBatsu.es
PupCloud.com
Wedding Album

line

 

--
line
logo   Antonio Rohman Fernandez
CEO, Founder & Lead Engineer
roh...@mahalostudio.com
  Projects
MaruBatsu.es
PupCloud.com
Wedding Album
line
--
line
logo   Antonio Rohman Fernandez
CEO, Founder & Lead Engineer
roh...@mahalostudio.com
  Projects
MaruBatsu.es
PupCloud.com
Wedding Album
line
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to