[
https://issues.apache.org/jira/browse/SOLR-7027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Pugh resolved SOLR-7027.
-----------------------------
Fix Version/s: (was: 5.2)
(was: 6.0)
Resolution: Won't Fix
In Solr 10 we are leveraging either Tika Server (running in it's own seperate
server process) or maybe Tika Pipes (again, running in a seperate JVM).
Please revalidate your issue against Solr 10 with one of those options, and if
it is still present need, happy to work with you on a fix using the new
approach for Tika.
> ExtractingRequestHandler indiscriminantly dumps all source HTML attributes
> into the catch-all field when captureAttr=false, but it should be more
> selective, something like only href, title, alt, etc. attributes
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-7027
> URL: https://issues.apache.org/jira/browse/SOLR-7027
> Project: Solr
> Issue Type: Improvement
> Components: contrib - Solr Cell (Tika extraction)
> Affects Versions: 5.0
> Reporter: Steven Rowe
> Priority: Minor
>
> On line 283 in {{SolrContentHandler}}, the catch-all field gets *all* source
> HTML attribute values dumped into it:
> {code:java}
> 270: @Override
> 271: public void startElement(String uri, String localName, String qName,
> Attributes attributes) throws SAXException {
> 272: StringBuilder theBldr = fieldBuilders.get(localName);
> 273: if (theBldr != null) {
> 274: //we need to switch the currentBuilder
> 275: bldrStack.add(theBldr);
> 276: }
> 277: if (captureAttribs == true) {
> 278: for (int i = 0; i < attributes.getLength(); i++) {
> 279: addField(localName, attributes.getValue(i), null);
> 280: }
> 281: } else {
> 282: for (int i = 0; i < attributes.getLength(); i++) {
> 283: bldrStack.getLast().append(' ').append(attributes.getValue(i));
> 284: }
> 285: }
> 286: bldrStack.getLast().append(' ');
> 287: }
> {code}
> But this will contains lots of unwanted cruft: {{class}} and {{style}} tags,
> etc.
> It would be much better if only attribute values containing addresses or
> tooltip text, etc. were dumped into the catch-all field. Here are a couple
> of places where this kind of attribute are described:
> http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute)
> From Tika's {{HtmlHandler}} class:
> {code:java}
> // List of attributes that need to be resolved.
> private static final Set<String> URI_ATTRIBUTES =
> new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite"));
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]