[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763527#comment-13763527
]
Phani Chaitanya Vempaty commented on SOLR-1301:
-----------------------------------------------
I wanted to look at the code but after I downloaded solr-4.4.0 and applied the
patch, I'm not able to create the eclipse project. It says that there is no
ivy.xml in solr-mr directory and it is missing indeed. I created one and now
everything is fine. Is ivy.xml missed as it is an initial-cut or am I doing
something wrong. Below is my ivy.xml.
{code:xml}
<ivy-module version="2.0">
<info organisation="org.apache.solr" module="solr-mr"/>
<dependencies>
<dependency org="org.apache.hadoop" name="hadoop-common"
rev="2.0.5-alpha" transitive="false"/>
<dependency org="org.apache.hadoop" name="hadoop-core" rev="1.2.1"
transitive="false"/>
<dependency org="org.apache.hadoop" name="hadoop-hdfs" rev="2.0.5-alpha"
transitive="false"/>
<dependency org="org.apache.hadoop" name="hadoop-mapreduce"
rev="2.0.5-alpha" transitive="false"/>
<dependency org="org.apache.hadoop" name="hadoop-mapreduce-client"
rev="2.0.5-alpha" transitive="false"/>
<dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core"
rev="2.0.5-alpha" transitive="false"/>
<dependency org="org.apache.hadoop" name="hadoop-mapred" rev="0.22.0"
transitive="false"/>
<dependency org="org.apache.hadoop" name="hadoop-yarn" rev="2.0.5-alpha"
transitive="false"/>
<dependency org="com.codahale.metrics" name="metrics-core" rev="3.0.1"
transitive="false"/>
<dependency org="com.cloudera.cdk" name="cdk-morphlines-core" rev="0.7.0"
transitive="false"/>
<dependency org="com.cloudera.cdk" name="cdk-morphlines-solr-core"
rev="0.7.0" transitive="false"/>
<dependency org="org.skife.com.typesafe.config" name="typesafe-config"
rev="0.3.0" transitive="false"/>
<dependency org="net.sourceforge.argparse4j" name="argparse4j"
rev="0.4.1" transitive="false"/>
<exclude org="*" ext="*" matcher="regexp" type="${ivy.exclude.types}"/>
</dependencies>
</ivy-module>
{code}
Though at least now I'm able to get the eclipse project to view the code, I
still have some compile errors in the project which I guess is mainly due to
the hadoop versions that I have in the above ivy.xml file w.r.t hadoop-core and
others (I'm not able to find 2.0.5-alpha from
http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/ - I did not look
into CDH jars though). I'm also not able to compile the code base after
applying this patch due to this very reason.
> Add a Solr contrib that allows for building Solr indexes via Hadoop's
> Map-Reduce.
> ---------------------------------------------------------------------------------
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
> Issue Type: New Feature
> Reporter: Andrzej Bialecki
> Assignee: Mark Miller
> Fix For: 4.5, 5.0
>
> Attachments: commons-logging-1.0.4.jar,
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar,
> hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch,
> log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch,
> SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains a contrib module that provides distributed indexing
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS.
> SolrOutputFormat consumes data produced by reduce tasks directly, without
> storing it in intermediate files. Furthermore, by using an
> EmbeddedSolrServer, the indexing task is split into as many parts as there
> are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat,
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter
> instantiates an EmbeddedSolrServer, and it also instantiates an
> implementation of SolrDocumentConverter, which is responsible for turning
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce
> task completes, and the OutputFormat is closed, SolrRecordWriter calls
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories
> as there were reduce tasks. The output shards are placed in the output
> directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories
> can be used to run N shard servers. Additionally, users can specify the
> number of reduce tasks, in particular 1 reduce task, in which case the output
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor
> and approved for release under Apache License.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]