[jira] [Commented] (SOLR-5758) need ref guide doc on building indexes with mapreduce (morphlines-cell contrib)

Mark Miller (JIRA) Sat, 22 Feb 2014 17:34:22 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909623#comment-13909623
 ]


Mark Miller commented on SOLR-5758:
-----------------------------------

As a start, here is the help text for the MapReduceIndexerTool:

{noformat}
usage: hadoop [GenericOptions]... jar solr-map-reduce-*.jar 
       [--help] --output-dir HDFS_URI [--input-list URI] --morphline-file FILE 
[--morphline-id STRING] [--update-conflict-resolver FQCN] [--mappers INTEGER] 
[--reducers INTEGER] [--max-segments INTEGER] [--fair-scheduler-pool STRING] 
[--dry-run] [--log4j FILE] [--verbose] [--show-non-solr-cloud]
       [--zk-host STRING] [--go-live] [--collection STRING] [--go-live-threads 
INTEGER] [HDFS_URI [HDFS_URI ...]]

MapReduce batch job driver that takes a morphline and creates a set of Solr 
index shards from a set of input files  and  writes  the  indexes  into  HDFS,  
in  a  flexible,  scalable and fault-tolerant manner. It also supports merging 
the output shards into a set of live customer facing Solr servers, typically a
SolrCloud. The program proceeds in several consecutive MapReduce based phases, 
as follows:

1) Randomization phase: This (parallel) phase randomizes the list of input 
files in order to spread indexing load more evenly among the mappers of the 
subsequent phase.

2) Mapper phase: This (parallel) phase takes the input files, extracts the 
relevant content, transforms it and hands SolrInputDocuments to  a  set  of  
reducers.  The  ETL  functionality is flexible and customizable using chains of 
arbitrary morphline commands that pipe records from one transformation command 
to
another. Commands to parse and transform a set of standard data formats such as 
Avro, CSV, Text, HTML, XML, PDF, Word, Excel, etc. are provided out of  the  
box, and additional custom commands and parsers for additional file or data 
formats can be added as morphline plugins. This is done by implementing a simple
Java interface that consumes a record (e.g. a file in the form of an 
InputStream plus some headers plus contextual metadata) and generates as output 
zero or more  records.  Any  kind of data format can be indexed and any Solr 
documents for any kind of Solr schema can be generated, and any custom ETL 
logic can be
registered and executed.
Record fields, including MIME types, can also explicitly be passed by force 
from the CLI to the morphline, for example: hadoop ... -D 
morphlineField._attachment_mimetype=text/csv

3) Reducer phase: This (parallel) phase loads the mapper's SolrInputDocuments 
into one EmbeddedSolrServer per reducer. Each such reducer and Solr server can 
be seen as a (micro) shard. The Solr servers store their data in HDFS.

4) Mapper-only merge phase: This (parallel) phase merges the set of reducer 
shards into the number of solr shards expected by the user, using a mapper-only 
job. This phase is omitted if the number of shards is already equal to the 
number of shards expected by the user. 

5) Go-live phase: This optional (parallel) phase merges the output shards of 
the previous phase into a set of live customer facing Solr servers, typically a 
SolrCloud. If this phase is omitted you can explicitly point each Solr server 
to one of the HDFS output shard directories.

Fault Tolerance: Mapper and reducer task attempts are retried on failure per 
the standard MapReduce semantics. On program startup all data in the 
--output-dir is deleted if that output directory already exists. If the whole 
job fails you can retry simply by rerunning the program again using the same 
arguments.

positional arguments:
  HDFS_URI               HDFS URI of file or directory tree to index. (default: 
[])

optional arguments:
  --help, -help, -h      Show this help message and exit
  --input-list URI       Local URI or HDFS URI of a UTF-8 encoded file 
containing a list of HDFS URIs to index, one URI per line in the file. If '-' 
is specified, URIs are read from the standard input. Multiple --input-list 
arguments can be specified.
  --morphline-id STRING  The identifier of the morphline that shall be executed 
within the morphline config file specified by --morphline-file. If the 
--morphline-id option is ommitted the first (i.e. top-most) morphline within 
the config file is used. Example: morphline1
  --update-conflict-resolver FQCN
                         Fully qualified class name of a Java class that 
implements the UpdateConflictResolver interface. This enables deduplication and 
ordering of a series of  document  updates for the same unique document key. 
For example, a MapReduce batch job might index multiple files in the same job 
where
                         some of the files contain old and new versions of the 
very same document, using the same unique document key.
                         Typically, implementations of this interface forbid 
collisions by throwing an exception, or ignore all but the most recent document 
version, or,  in  the  general case, order colliding updates ascending from 
least recent to most recent (partial) update. The caller of this interface (i.e.
                         the Hadoop Reducer) will then apply the updates to 
Solr in the order returned by the orderUpdates() method.
                         The  default  RetainMostRecentUpdateConflictResolver  
implementation  ignores  all  but  the   most   recent   document   version,   
based   on   a   configurable   numeric   Solr   field,   which  defaults  to  
the  file_last_modified  timestamp  (default:  org.apache.solr.hadoop.dedup.
                         RetainMostRecentUpdateConflictResolver)
  --mappers INTEGER      Tuning knob that indicates the maximum number of MR 
mapper tasks to use. -1 indicates use all map slots available on the cluster. 
(default: -1)
  --reducers INTEGER     Tuning knob that indicates the number of reducers to 
index into. -1 indicates use all reduce slots available on the cluster. 0 
indicates use  one reducer per output shard, which disables the mtree merge MR 
algorithm. The mtree merge MR algorithm improves scalability by spreading load 
(in
                         particular CPU load) among a number of parallel 
reducers that can be much larger than the number of solr shards expected by the 
 user.  It  can  be seen as an extension of concurrent lucene merges and tiered 
lucene merges to the clustered case. The subsequent mapper-only phase merges the
                         output of said large number of reducers to the number 
of shards expected by the user, again by utilizing more available parallelism 
on the cluster. (default: -1)
  --max-segments INTEGER
                         Tuning knob that indicates the maximum number of 
segments to be contained on output in the index of each reducer shard. After  a 
 reducer  has  built  its  output  index it applies a merge policy to merge 
segments until there are <= maxSegments lucene segments left in this index. 
Merging
                         segments involves reading and rewriting all data in 
all these segment files, potentially multiple times, which is very I/O 
intensive and  time  consuming.  However,  an  index  with fewer segments can 
later be merged faster, and it can later be queried faster once deployed to a 
live Solr
                         serving shard. Set maxSegments to 1 to optimize the 
index for low query latency. In a nutshell, a small maxSegments value trades 
indexing latency for subsequently improved query latency. This can be a 
reasonable trade-off for batch indexing systems. (default: 1)
  --fair-scheduler-pool STRING
                         Optional tuning knob that indicates the name of the 
fair scheduler pool to submit jobs to. The Fair Scheduler is a pluggable 
MapReduce scheduler that  provides a way to share large clusters. Fair 
scheduling is a method of assigning resources to jobs such that all jobs get, 
on average, an
                         equal share of resources over time. When there is a 
single job running, that job uses the entire cluster. When other jobs are 
submitted, tasks slots  that free up are assigned to the new jobs, so that each 
job gets roughly the same amount of CPU time. Unlike the default Hadoop 
scheduler,
                         which forms a queue of jobs, this lets short jobs 
finish in reasonable time while not starving long jobs. It is also an easy way 
to share  a cluster between multiple of users. Fair sharing can also work with 
job priorities - the priorities are used as weights to determine t
Generic options supported are
  --conf <configuration file>
                         specify an application configuration file
  -D <property=value>    use value for given property
  --fs <local|namenode:port>
                         specify a namenode
  --jt <local|jobtracker:port>
                         specify a job tracker
  --files <comma separated list of files>
                         specify comma separated files to be copied to the map 
reduce cluster
  --libjars <comma separated list of jars>
                         specify comma separated jar files to include in the 
classpath.
  --archives <comma separated list of archives>
                         specify comma separated archives to be unarchived on 
the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

Examples: 

# (Re)index an Avro based Twitter tweet file:
sudo -u hdfs hadoop \
  --config /etc/hadoop/conf.cloudera.mapreduce1 \
  jar target/solr-map-reduce-*.jar \
  -D 'mapred.child.java.opts=-Xmx500m' \
  --log4j src/test/resources/log4j.properties \
  --morphline-file 
../search-core/src/test/resources/test-morphlines/tutorialReadAvroContainer.conf
 \
  --solr-home-dir src/test/resources/solr/minimr \
  --output-dir hdfs://c2202.mycompany.com/user/$USER/test \
  --shards 1 \
  hdfs:///user/$USER/test-documents/sample-statuses-20120906-141433.avro

# (Re)index all files that match all of the following conditions:
# 1) File is contained in dir tree 
hdfs:///user/$USER/solrloadtest/twitter/tweets
# 2) file name matches the glob pattern 'sample-statuses*.gz'
# 3) file was last modified less than 100000 minutes ago
# 4) file size is between 1 MB and 1 GB
# Also include extra library jar file containing JSON tweet Java parser:
hadoop jar target/solr-map-reduce-*.jar 
com.cloudera.cdk.morphline.hadoop.find.HdfsFindTool \
  -find hdfs:///user/$USER/solrloadtest/twitter/tweets \
  -type f \
  -name 'sample-statuses*.gz' \
  -mmin -1000000 \
  -size -100000000c \
  -size +1000000c \
| sudo -u hdfs hadoop \
  --config /etc/hadoop/conf.cloudera.mapreduce1 \
  jar target/solr-map-reduce-*.jar \
  -D 'mapred.child.java.opts=-Xmx500m' \
  --log4j src/test/resources/log4j.properties \
  --morphline-file 
../search-core/src/test/resources/test-morphlines/tutorialReadJsonTestTweets.conf
 \
  --solr-home-dir src/test/resources/solr/minimr \
  --output-dir hdfs://c2202.mycompany.com/user/$USER/test \
  --shards 100 \
  --input-list -

# Go live by merging resulting index shards into a live Solr cluster
# (explicitly specify Solr URLs - for a SolrCloud cluster see next example):
sudo -u hdfs hadoop \
  --config /etc/hadoop/conf.cloudera.mapreduce1 \
  jar target/solr-map-reduce-*.jar \
  -D 'mapred.child.java.opts=-Xmx500m' \
  --log4j src/test/resources/log4j.properties \
  --morphline-file 
../search-core/src/test/resources/test-morphlines/tutorialReadAvroContainer.conf
 \
  --solr-home-dir src/test/resources/solr/minimr \
  --output-dir hdfs://c2202.mycompany.com/user/$USER/test \
  --shard-url http://solr001.mycompany.com:8983/solr/collection1 \
  --shard-url http://solr002.mycompany.com:8983/solr/collection1 \
  --go-live \
  hdfs:///user/foo/indir

# Go live by merging resulting index shards into a live SolrCloud cluster
# (discover shards and Solr URLs through ZooKeeper):
sudo -u hdfs hadoop \
  --config /etc/hadoop/conf.cloudera.mapreduce1 \
  jar target/solr-map-reduce-*.jar \
  -D 'mapred.child.java.opts=-Xmx500m' \
  --log4j src/test/resources/log4j.properties \
  --morphline-file 
../search-core/src/test/resources/test-morphlines/tutorialReadAvroContainer.conf
 \
  --output-dir hdfs://c2202.mycompany.com/user/$USER/test \
  --zk-host zk01.mycompany.com:2181/solr \
  --collection collection1 \
  --go-live \
  hdfs:///user/foo/indir
{noformat}

> need ref guide doc on building indexes with mapreduce (morphlines-cell 
> contrib)
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-5758
>                 URL: https://issues.apache.org/jira/browse/SOLR-5758
>             Project: Solr
>          Issue Type: Task
>          Components: documentation
>            Reporter: Hoss Man
>            Assignee: Mark Miller
>             Fix For: 4.8
>
>
> This is marked experimental for 4.7, but we should have a section on it in 
> the ref guide in 4.8



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-5758) need ref guide doc on building indexes with mapreduce (morphlines-cell contrib)

Reply via email to