Apologies in advance if this topic/question has been previously answered…I have
scoured the docs, mail archives, web looking for an answer(s) with no luck. I
am sure I am just being dense or missing something obvious…please point out my
stupidity as my head hurts trying to get this working.
Solr 3.1
Java 1.6
Eclipse/Tomcat 7/Maven 2.x
Goal: to extract manufacturer names from a repeating list of keywords each
denoted by a Category, one of which is "Manufacturer", and load them into a
MsgKeywordMF field (see xml below)
I have xml files I am loading via DIH. This an abbreviated example xml data
(each file has repeating "Report" items, each report has repeating MsgSet, Msg,
MsgList, etc items). Notice the nested repeating groups, namely MsgItems,
within each document (Report):
<Report>
<ReportMeta>
<ReportDate>02/22/2011</ReportDate>
…
</ReportMeta>
<MsgSet>
<Msg>
<SourceDocID>http://someurl.com/path/to/doc</SourceDocID>
…
<DocumentText>........blah blah</DocumentText>
<MsgList>
<MsgItem>
<MsgType>SomeType</MsgType>
<Category>Location</Category>
<Keyword>USA</Keyword>
</MsgItem>
<MsgItem>
<MsgType>AnotherType</MsgType>
<Category>Manufacturer</Category>
<Keyword>Apple</Keyword>
</MsgItem>
…
</MsgList>
</Msg>
</MsgSet>
</Report>
<Report>
…
</Report>
<Report>
…
</Report>
…
Here is my data-config.xml:
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="fileload" rootEntity="false"
processor="FileListEntityProcessor" fileName="^.*\.xml$"
recursive="false" baseDir="/files/xml/">
<entity name="report"
rootEntity="true" pk="id"
url="${fileload.fileAbsolutePath}"
processor="XPathEntityProcessor"
forEach="/Report/MsgSet/Msg" onError="skip"
transformer="DateFormatTransformer,RegexTransformer">
<field column="DocumentText" xpath="/Report/MsgSet/Msg/DocumentText"/>
<field column="id" xpath="/Report/MsgSet/Msg/SourceDocID"/>
<field column="MsgCategory"
xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Category" />
<field column="MsgKeyword" xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Keyword"
/>
<field column="MsgKeywordMF"
xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword" />
…
</entity>
</entity>
</document>
</dataConfig>
As seen in my config and sample data above, I am extracting the repeating
"Keywords" into the the MsgKeyword field. Also, and the part that does NOT
work, I am trying to extract into a separate field just the keywords that have
a "Category" of "Manufacturer" --> <field column="MsgKeywordMF"
xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword" />
I have also tried: <field column="MsgKeywordMF"
xpath="/Report/MsgSet/Msg/MsgList/MsgItem[@Category='Manufacturer']/Keyword" />
…after changing the "Category" to an attribute of MsgItem (<MsgItem
Category="Location">) but it too fails to match.
I have tested my xpath notation against my xml data file using various xpath
evaluator tools, like within Eclipse, and it matches perfectly…but I can't get
it to match/work during import.
As I am able to understand it, DIH does not support nested/correlated entities,
at least not with XML data sources using nested entity tags. I've tried
without success to nest entities but I can't "correlate" the nested entity with
the parent. I think the way I'm trying should work, but no luck so far….
BTW, I can't easily change the xml format, although it is possible with some
pain…
Any ideas?
TIA,
-- Eric