DIH should be able read data directly from HDFS for indexing
------------------------------------------------------------
Key: SOLR-2096
URL: https://issues.apache.org/jira/browse/SOLR-2096
Project: Solr
Issue Type: New Feature
Components: contrib - DataImportHandler
Affects Versions: 1.4.1
Reporter: Amit Nithian
Fix For: 1.4.2
Attachments: hdfs_reader.tar
DIH doesn't support reading from the hdfs:// protocol which makes it hard to
index data generated by a M/R job. This tarball contains a subclass of the
URLDataSource along with an HDFSReader that allows for this. The data is
assumed to be in text format and able to be processed by the
LineEntityProcessor.
Here is an example DIH-Config snippet:
<dataSource name="queryData"
type="org.apache.solr.handler.dataimport.hdfs.HDFSDataSource"
baseUrl="hdfs://<YOURSERVER>:9000/" encoding="UTF-8"
connectionTimeout="5000" readTimeout="10000"/>
<document name="autoSuggester">
<entity name="jc" processor="LineEntityProcessor"
url="<YOUR FOLDER>/part*" dataSource="queryData">
<!-- Field mappings here if necessary -->
</entity>
</document>
</dataConfig>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]