[
https://issues.apache.org/jira/browse/SOLR-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gary Taylor updated SOLR-7174:
------------------------------
Summary: DIH can't use TikaEntityProcessor as inner entity because it's not
capable of re-entry. (was: Can't index a directory of files using DIH with
BinFileDataSource and TikaEntityProcessor)
> DIH can't use TikaEntityProcessor as inner entity because it's not capable of
> re-entry.
> ---------------------------------------------------------------------------------------
>
> Key: SOLR-7174
> URL: https://issues.apache.org/jira/browse/SOLR-7174
> Project: Solr
> Issue Type: Bug
> Components: contrib - DataImportHandler
> Affects Versions: 5.0
> Environment: Windows 7. Ubuntu 14.04.
> Reporter: Gary Taylor
> Labels: dataimportHandler, tika,text-extraction
> Attachments: SOLR-7174.patch
>
>
> Downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" and then "solr
> create -c hn2" to create a new core.
> I want to index a load of epub files that I've got in a directory. So I
> created a data-import.xml (in solr\hn2\conf):
> <dataConfig>
> <dataSource type="BinFileDataSource" name="bin" />
> <document>
> <entity name="files" dataSource="null" rootEntity="false"
> processor="FileListEntityProcessor"
> baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
> onError="skip"
> recursive="true">
> <field column="fileAbsolutePath" name="id" />
> <field column="fileSize" name="size" />
> <field column="fileLastModified" name="lastModified" />
> <entity name="documentImport" processor="TikaEntityProcessor"
> url="${files.fileAbsolutePath}" format="text"
> dataSource="bin" onError="skip">
> <field column="file" name="fileName"/>
> <field column="Author" name="author" meta="true"/>
> <field column="title" name="title" meta="true"/>
> <field column="text" name="content"/>
> </entity>
> </entity>
> </document>
> </dataConfig>
> In my solrconfig.xml, I added a requestHandler entry to reference my
> data-import.xml:
> <requestHandler name="/dataimport"
> class="org.apache.solr.handler.dataimport.DataImportHandler">
> <lst name="defaults">
> <str name="config">data-import.xml</str>
> </lst>
> </requestHandler>
> I renamed managed-schema to schema.xml, and ensured the following doc fields
> were setup:
> <field name="id" type="string" indexed="true" stored="true"
> required="true" multiValued="false" />
> <field name="fileName" type="string" indexed="true" stored="true" />
> <field name="author" type="string" indexed="true" stored="true" />
> <field name="title" type="string" indexed="true" stored="true" />
> <field name="size" type="long" indexed="true" stored="true" />
> <field name="lastModified" type="date" indexed="true" stored="true" />
> <field name="content" type="text_en" indexed="false" stored="true"
> multiValued="false"/>
> <field name="text" type="text_en" indexed="true" stored="false"
> multiValued="true"/>
> <copyField source="content" dest="text"/>
> I copied all the jars from dist and contrib\* into server\solr\lib.
> Stopping and restarting solr then creates a new managed-schema file and
> renames schema.xml to schema.xml.back
> All good so far.
> Now I go to the web admin for dataimport
> (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute
> a full import.
> But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" -
> ie. it only adds one document (the very first one) even though it's iterated
> over 58!
> No errors are reported in the logs.
> I can repeat this on Ubuntu 14.04 using the same steps, so it's not Windows
> specific.
> -----------------
> If I change the data-import.xml to use FileDataSource and
> PlainTextEntityProcessor and parse txt files, eg:
> <dataConfig>
> <dataSource type="FileDataSource" name="bin" />
> <document>
> <entity name="files" dataSource="null" rootEntity="false"
> processor="FileListEntityProcessor"
> baseDir="c:/Users/gt/Documents/epub" fileName=".*txt">
> <field column="fileAbsolutePath" name="id" />
> <field column="fileSize" name="size" />
> <field column="fileLastModified" name="lastModified" />
> <entity name="documentImport"
> processor="PlainTextEntityProcessor"
> url="${files.fileAbsolutePath}" format="text"
> dataSource="bin">
> <field column="plainText" name="content"/>
> </entity>
> </entity>
> </document>
> </dataConfig>
> This works. So it's a combo of BinFileDataSource and TikaEntityProcessor
> that is failing.
> On Windows, I ran Process Monitor, and spotted that only the very first epub
> file is actually being read (repeatedly).
> With verbose and debug on when running the DIH, I get the following response:
> ....
> "verbose-output": [
> "entity:files",
> [
> null,
> "----------- row #1-------------",
> "fileSize",
> 2609004,
> "fileLastModified",
> "2015-02-25T11:37:25.217Z",
> "fileAbsolutePath",
> "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
> "fileDir",
> "c:\\Users\\gt\\Documents\\epub",
> "file",
> "issue018.epub",
> null,
> "---------------------------------------------",
> "entity:documentImport",
> [
> "document#1",
> [
> "query",
> "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
> "time-taken",
> "0:0:0.0",
> null,
> "----------- row #1-------------",
> "text",
> "< ... parsed epub text - snip ... >"
> "title",
> "Issue 18 title",
> "Author",
> "Author text",
> null,
> "---------------------------------------------"
> ],
> "document#2",
> []
> ],
> null,
> "----------- row #2-------------",
> "fileSize",
> 4428804,
> "fileLastModified",
> "2015-02-25T11:37:36.399Z",
> "fileAbsolutePath",
> "c:\\Users\\gt\\Documents\\epub\\issue019.epub",
> "fileDir",
> "c:\\Users\\gt\\Documents\\epub",
> "file",
> "issue019.epub",
> null,
> "---------------------------------------------",
> "entity:documentImport",
> [
> "document#2",
> []
> ],
> null,
> "----------- row #3-------------",
> "fileSize",
> 2580266,
> "fileLastModified",
> "2015-02-25T11:37:41.188Z",
> "fileAbsolutePath",
> "c:\\Users\\gt\\Documents\\epub\\issue020.epub",
> "fileDir",
> "c:\\Users\\gt\\Documents\\epub",
> "file",
> "issue020.epub",
> null,
> "---------------------------------------------",
> "entity:documentImport",
> [
> "document#2",
> []
> ],
> ....
> ....
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]