[ 
https://issues.apache.org/jira/browse/SOLR-7027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Pugh resolved SOLR-7027.
-----------------------------
    Fix Version/s:     (was: 5.2)
                       (was: 6.0)
       Resolution: Won't Fix

In Solr 10 we are leveraging either Tika Server (running in it's own seperate 
server process) or maybe Tika Pipes (again, running in a seperate JVM).   
Please revalidate your issue against Solr 10 with one of those options, and if 
it is still present need, happy to work with you on a fix using the new 
approach for Tika.

> ExtractingRequestHandler indiscriminantly dumps all source HTML attributes 
> into the catch-all field when captureAttr=false, but it should be more 
> selective, something like only href, title, alt, etc. attributes
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7027
>                 URL: https://issues.apache.org/jira/browse/SOLR-7027
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 5.0
>            Reporter: Steven Rowe
>            Priority: Minor
>
> On line 283 in {{SolrContentHandler}}, the catch-all field gets *all* source 
> HTML attribute values dumped into it:
> {code:java}
> 270:  @Override
> 271:  public void startElement(String uri, String localName, String qName, 
> Attributes attributes) throws SAXException {
> 272:    StringBuilder theBldr = fieldBuilders.get(localName);
> 273:    if (theBldr != null) {
> 274:      //we need to switch the currentBuilder
> 275:      bldrStack.add(theBldr);
> 276:    }
> 277:    if (captureAttribs == true) {
> 278:      for (int i = 0; i < attributes.getLength(); i++) {
> 279:        addField(localName, attributes.getValue(i), null);
> 280:      }
> 281:    } else {
> 282:      for (int i = 0; i < attributes.getLength(); i++) {
> 283:        bldrStack.getLast().append(' ').append(attributes.getValue(i));
> 284:      }
> 285:    }
> 286:    bldrStack.getLast().append(' ');
> 287:  }
> {code}
> But this will contains lots of unwanted cruft: {{class}} and {{style}} tags, 
> etc.
> It would be much better if only attribute values containing addresses or 
> tooltip text, etc. were dumped into the catch-all field.  Here are a couple 
> of places where this kind of attribute are described:
> http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute)
> From Tika's {{HtmlHandler}} class:
> {code:java}
>     // List of attributes that need to be resolved.
>     private static final Set<String> URI_ATTRIBUTES =
>         new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite"));
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to