[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790411#comment-13790411
]
Mark Miller commented on SOLR-1301:
-----------------------------------
I have a new patch I'm cleaning up that tackles some of the packaging:
* Split out solr-morphlines-core and solr-morphlines-cell into their own
modules.
* Updated to trunk and the new modules are now using the new dependency version
tracking system.
* Fixed an issue in the code around the TokenStream contract being violated -
the latest code detected this and failed a test - end and close now called.
* Updated to use Morphlines from CDK 0.8.
* Setup the main class in the solr-mr jar manifest.
* I enabled an ignored test which exposed a few bugs because of the required
solr.xml in Solr 5.0 - I addressed those bugs.
* Added a missing metrics health-check dependency that somehow popped up.
* I played around with naming the solr-mr artifact MapReduceIndexTool.jar, but
the system really want's us to follow the rules of the artifacts and have
something like solr-solr-mr-5.0.jar. Anything else has some random issues, such
as with javadoc, and if your name does not start with solr-, it will be changed
to start with lucene-. I'm not yet sure if it's worth the trouble to expand the
system or use a different name, so for now it's still just using the default
jar name based on the contrib module name (solr-mr).
Besides the naming issue, there are a couple other things to button up:
* How we are going to set up the classpath - script, in the manifest, leave it
up to the user and doc, etc.
* All dependencies are currently in solr-morphlines-core - this was a simple
way to split out the modules since solr-mr and solr-morphlines-cell depend on
solr-morphlines-core.
Finally, we will probably need some help from [~steve_rowe] to get the Maven
build setup correctly.
I spent a bunch of time trying to use asm to work around the hacked test policy
issue. There are multiple problems I ran into. One is that another module uses
asm 4.1, but Hadoop brings in asm 3.1 - if you are doing some asm coding, this
can cause compile issues with your ide (at least eclipse). It also ends up
being really hard to get an injection in the right place because of how the
yarn code is structured. After spending a bunch of time trying to get this to
work, I'm backing out and considering other options.
> Add a Solr contrib that allows for building Solr indexes via Hadoop's
> Map-Reduce.
> ---------------------------------------------------------------------------------
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
> Issue Type: New Feature
> Reporter: Andrzej Bialecki
> Assignee: Mark Miller
> Fix For: 4.6
>
> Attachments: commons-logging-1.0.4.jar,
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar,
> hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch,
> log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch,
> SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains a contrib module that provides distributed indexing
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS.
> SolrOutputFormat consumes data produced by reduce tasks directly, without
> storing it in intermediate files. Furthermore, by using an
> EmbeddedSolrServer, the indexing task is split into as many parts as there
> are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat,
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter
> instantiates an EmbeddedSolrServer, and it also instantiates an
> implementation of SolrDocumentConverter, which is responsible for turning
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce
> task completes, and the OutputFormat is closed, SolrRecordWriter calls
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories
> as there were reduce tasks. The output shards are placed in the output
> directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories
> can be used to run N shard servers. Additionally, users can specify the
> number of reduce tasks, in particular 1 reduce task, in which case the output
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor
> and approved for release under Apache License.
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]