[jira] [Commented] (LUCENE-5943) HTML strip filter removes text between < and >

Steve Rowe (JIRA) Tue, 16 Sep 2014 07:42:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135540#comment-14135540
 ]


Steve Rowe commented on LUCENE-5943:
------------------------------------

[~sulemanmubarik], I can't reproduce the behavior you're seeing.

This test, added to {{o.a.l.analysis.common.HTMLStripCharFilterTest}} succeeds 
on {{trunk/}} and on {{tags/lucene_solr_4_8_0/}}:

{code:java}
  public void testTrailingTag() throws Exception {
    String html = "I love <pizza hut>";
    String gold = "I love \n";
    assertHTMLStripsTo(html, gold, null);

    html = "I feel conflicted about <html>";
    gold = "I feel conflicted about \n";
    assertHTMLStripsTo(html, gold, null);
  }
{code}


> HTML strip filter removes text between < and >
> ----------------------------------------------
>
>                 Key: LUCENE-5943
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5943
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/index
>         Environment: Production
>            Reporter: suleman mubarik
>
> If I have this as input “I love <pizza  hut> so much”
> When I apply html striper it removes “pizza  hut” and I get tokens "i", 
> "love" ,"so", "much"
> And these are offsets I get back ((0,1), (2,6), (20,22), (23,27))
> Html strip filter should return "i", "love" ,"pizza", "hut", "so", "much"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5943) HTML strip filter removes text between < and >

Reply via email to