[ 
https://issues.apache.org/jira/browse/LUCENE-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-5211:
-----------------------------

    Component/s:     (was: core/search)
    Description: 
StopFilterFactory supports a "format" option for controlling wether 
"getWordSet" or "getSnowballWordSet" is used to parse the file, but this option 
is not advertised and people can be confused by looking at the example stopword 
files include in the releases (some of which are in the snoball format w/ "|" 
comments) and try to use them w/o explicitly specifying {{format="snowball"}} 
and silently get useless stopwords (that include the "| comments" as literal 
portions of hte stopwrds.

we need to better document the use of "format" and consider updating all of the 
example stopword files we ship that are in the snowball format with a note 
about the need to use {{format="snowball"}} with those files.

{panel:title=Initial Bug Report}

The StopFilterFactory builds a CharArraySet directly from the raw lines of the 
supplied words file. This causes a problem when using the stop word files 
supplied with the Solr/Lucene distribution. In particular, the comments in 
those files get added to the CharArraySet. A line like this...

ceci           |  this

Should result in the string "ceci" being added to the CharArraySet, but "ceci   
        |  this" is what actually gets added.

Workaround: Remove all comments from stop word files you are using.

Suggested fix: The StopFilterFactory should strip any comments, then strip 
trailing whitespace. The stop word files supplied with the distribution should 
be edited to conform to the supported comment format.
{panel}

  was:
The StopFilterFactory builds a CharArraySet directly from the raw lines of the 
supplied words file. This causes a problem when using the stop word files 
supplied with the Solr/Lucene distribution. In particular, the comments in 
those files get added to the CharArraySet. A line like this...

ceci           |  this

Should result in the string "ceci" being added to the CharArraySet, but "ceci   
        |  this" is what actually gets added.

Workaround: Remove all comments from stop word files you are using.

Suggested fix: The StopFilterFactory should strip any comments, then strip 
trailing whitespace. The stop word files supplied with the distribution should 
be edited to conform to the supported comment format.

       Priority: Minor  (was: Major)
       Assignee: Hoss Man
        Summary: StopFilterFactory docs do not advertise/explain hte "format" 
option  (was: StopFilterFactory does not honor comments)
    
> StopFilterFactory docs do not advertise/explain hte "format" option
> -------------------------------------------------------------------
>
>                 Key: LUCENE-5211
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5211
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.2
>            Reporter: Hayden Muhl
>            Assignee: Hoss Man
>            Priority: Minor
>
> StopFilterFactory supports a "format" option for controlling wether 
> "getWordSet" or "getSnowballWordSet" is used to parse the file, but this 
> option is not advertised and people can be confused by looking at the example 
> stopword files include in the releases (some of which are in the snoball 
> format w/ "|" comments) and try to use them w/o explicitly specifying 
> {{format="snowball"}} and silently get useless stopwords (that include the "| 
> comments" as literal portions of hte stopwrds.
> we need to better document the use of "format" and consider updating all of 
> the example stopword files we ship that are in the snowball format with a 
> note about the need to use {{format="snowball"}} with those files.
> {panel:title=Initial Bug Report}
> The StopFilterFactory builds a CharArraySet directly from the raw lines of 
> the supplied words file. This causes a problem when using the stop word files 
> supplied with the Solr/Lucene distribution. In particular, the comments in 
> those files get added to the CharArraySet. A line like this...
> ceci           |  this
> Should result in the string "ceci" being added to the CharArraySet, but "ceci 
>           |  this" is what actually gets added.
> Workaround: Remove all comments from stop word files you are using.
> Suggested fix: The StopFilterFactory should strip any comments, then strip 
> trailing whitespace. The stop word files supplied with the distribution 
> should be edited to conform to the supported comment format.
> {panel}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to