I had posted this on the solr-user forum but have received no replies so I thought I would try here next. thanks.
I'm processing a zip file with an xml file. The TikaEntityProcessor opens the zip, reads the file but is stripping the xml tags even though I have supplied the htmlMapper="identity" attribute. It maintains any html that is contained in a CDATA section but seems to strip the other xml tags. Is this due to the recursive nature of opening the zip file? Somehow that identity value is lost? My understanding is that this should work in this version 4.8. Thanks. Below is my config info. <dataConfig><dataSource type="BinFileDataSource" /><document> <entity name="kmlfiles" dataSource=null" rootEntity="false" baseDir="mydirectory" fileName=".*\.kmz$" onError="skip" processor="FileListEntityProcessor" recursive="false" > <field defs........................ /> <entity name="kmlImport" processor="TikaEntityProcessor" datasource="kmlfiles" htmlMapper="identity" format="xml" transformer="TemplateTransformer" url="${kmlfiles.fileAbsolutePath}" recursive="true"> <more field defs.... /> <entity name="xml" processor="XPathEntityProcessor" ForEach="/kml" dataSource="fds" dataField="kmlImport.text"> <field xpath=//name" column="name" /> ...more field defs </entity> </entity> </entity> </document></dataConfig> Note that it does wrap my data in html but it is after it strips all my xml tags out. So the data I am interested in parsing which would be <name>something</name> <description>something</description> <coordinates>12345,12345,0</coordinates> end up like <p>/n something /t/n something /n 12345,12345,0 ....etc. -- View this message in context: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-stripping-all-xml-tags-tp4160432.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.