[
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Noble Paul updated SOLR-5069:
-----------------------------
Comment: was deleted
(was: Commit 1571703 from [~noble.paul] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1571703 ]
SOLR-5069 fixing test failure)
> MapReduce for SolrCloud
> -----------------------
>
> Key: SOLR-5069
> URL: https://issues.apache.org/jira/browse/SOLR-5069
> Project: Solr
> Issue Type: New Feature
> Components: SolrCloud
> Reporter: Noble Paul
> Assignee: Noble Paul
>
> Solr currently does not have a way to run long running computational tasks
> across the cluster. We can piggyback on the mapreduce paradigm so that users
> have smooth learning curve.
> * The mapreduce component will be written as a RequestHandler in Solr
> * Works only in SolrCloud mode. (No support for standalone mode)
> * Users can write MapReduce programs in Javascript or Java. First cut would
> be JS ( ? )
> h1. sample word count program
> h2.how to invoke?
> http://host:port/solr/collection-x/mapreduce?map=<map-script>&reduce=<reduce-script>&sink=collectionX
> h3. params
> * map : A javascript implementation of the map program
> * reduce : a Javascript implementation of the reduce program
> * sink : The collection to which the output is written. If this is not passed
> , the request will wait till completion and respond with the output of the
> reduce program and will be emitted as a standard solr response. . If no sink
> is passed the request will be redirected to the "reduce node" where it will
> wait till the process is complete. If the sink param is passed ,the rsponse
> will contain an id of the run which can be used to query the status in
> another command.
> * reduceNode : Node name where the reduce is run . If not passed an arbitrary
> node is chosen
> The node which received the command would first identify one replica from
> each slice where the map program is executed . It will also identify one
> another node from the same collection where the reduce program is run. Each
> run is given an id and the details of the nodes participating in the run will
> be written to ZK (as an ephemeral node).
> h4. map script
> {code:JavaScript}
> var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on
> this index
> while(res.hasMore()){
> var doc = res.next();
> var txt = doc.get(“txt”);//the field on which word count is performed
> var words = txt.split(" ");
> for(i = 0; i < words.length; i++){
> $.map(words[i],{‘count’:1});// this will send the map over to //the
> reduce host
> }
> }
> {code}
> Essentially two threads are created in the 'map' hosts . One for running the
> program and the other for co-ordinating with the 'reduce' host . The maps
> emitted are streamed live over an http connection to the reduce program
> h4. reduce script
> This script is run in one node . This node accepts http connections from map
> nodes and the 'maps' that are sent are collected in a queue which will be
> polled and fed into the reduce program. This also keeps the 'reduced' data in
> memory till the whole run is complete. It expects a "done" message from all
> 'map' nodes before it declares the tasks are complete. After reduce program
> is executed for all the input it proceeds to write out the result to the
> 'sink' collection or it is written straight out to the response.
> {code:JavaScript}
> var pair = $.nextMap();
> var reduced = $.getCtx().getReducedMap();// a hashmap
> var count = reduced.get(pair.key());
> if(count === null) {
> count = {“count”:0};
> reduced.put(pair.key(), count);
> }
> count.count += pair.val().count ;
> {code}
> h4.example output
> {code:JavaScript}
> {
> “result”:[
> “wordx”:{
> “count”:15876765
> },
> “wordy” : {
> “count”:24657654
> }
>
> ]
> }
> {code}
> TBD
> * The format in which the output is written to the target collection, I
> assume the reducedMap will have values mapping to the schema of the collection
>
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]