Re: Metadata and HTML ending up in searchable text

2016-06-02 Thread Simon Blandford
"ignored"? This would strip out the attr_ fields so they wouldn't even be indexed...if you don't want them. As for the HTML file, it looks like Tika is failing to strip out the style section. Try running the file alone with tika-app: java -jar tika-app.jar -t inputfi

Re: Metadata and HTML ending up in searchable text

2016-06-01 Thread Simon Blandford
be indexed...if you don't want them. As for the HTML file, it looks like Tika is failing to strip out the style section. Try running the file alone with tika-app: java -jar tika-app.jar -t inputfile.html. If you are finding the noise there. Please open an issue on our JIRA: https://issues.

Re: Metadata and HTML ending up in searchable text

2016-05-31 Thread Simon Blandford
ilar. Not very helpful, I know. Regards, Alex. Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 27 May 2016 at 23:48, Simon Blandford wrote: Hi Timothy, Thanks for responding. java -jar tika-app-1.13.jar -t "/home/user/Documents

Re: Metadata and HTML ending up in searchable text

2016-05-27 Thread Simon Blandford
his would strip out the attr_ fields so they wouldn't even be indexed...if you don't want them. As for the HTML file, it looks like Tika is failing to strip out the style section. Try running the file alone with tika-app: java -jar tika-app.jar -t inputfile.html. If you are finding

Metadata and HTML ending up in searchable text

2016-05-26 Thread Simon Blandford
Hi, I am using Solr 6.0 on Ubuntu 14.04. I am ending up with loads of junk in the text body. It starts like, The JSON entry output of a search result shows the indexed text starting with... body_txt_en: " stream_size 36499 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By" An