[ 
https://issues.apache.org/jira/browse/LUCENE-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-5211:
-----------------------------

    Attachment: LUCENE-5211.code.patch
                LUCENE-5211.stopfilecomments.patch

two patches to make it easier to review...

* patch that improves the StopFilterFactory javadocs to mention format, as well 
as improves the error handling of the format param (includes tests)
* patch that updates all the snowball formatted files with a comment pointing 
out hteneed to use format="snowball" with those files.

FWIW: the second patch was generated by the following perl script...

{code}
#!/usr/bin/perl -i -n

my $msg = q{NOTE: To use this file with StopFilterFactory, you must specify 
format="snowball"};
print $_;
if (m/This notice was added./) {
    print " |\n | $msg\n";
}
{code}

Run as...
{{find -name \*.txt | xargs grep -l "This notice was added" | xargs 
~/tmp/lucene5211.note.in.snowballfiles.pl}}

                
> StopFilterFactory docs do not advertise/explain hte "format" option
> -------------------------------------------------------------------
>
>                 Key: LUCENE-5211
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5211
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.2
>            Reporter: Hayden Muhl
>            Assignee: Hoss Man
>            Priority: Minor
>         Attachments: LUCENE-5211.code.patch, 
> LUCENE-5211.stopfilecomments.patch
>
>
> StopFilterFactory supports a "format" option for controlling wether 
> "getWordSet" or "getSnowballWordSet" is used to parse the file, but this 
> option is not advertised and people can be confused by looking at the example 
> stopword files include in the releases (some of which are in the snoball 
> format w/ "|" comments) and try to use them w/o explicitly specifying 
> {{format="snowball"}} and silently get useless stopwords (that include the "| 
> comments" as literal portions of hte stopwrds.
> we need to better document the use of "format" and consider updating all of 
> the example stopword files we ship that are in the snowball format with a 
> note about the need to use {{format="snowball"}} with those files.
> {panel:title=Initial Bug Report}
> The StopFilterFactory builds a CharArraySet directly from the raw lines of 
> the supplied words file. This causes a problem when using the stop word files 
> supplied with the Solr/Lucene distribution. In particular, the comments in 
> those files get added to the CharArraySet. A line like this...
> ceci           |  this
> Should result in the string "ceci" being added to the CharArraySet, but "ceci 
>           |  this" is what actually gets added.
> Workaround: Remove all comments from stop word files you are using.
> Suggested fix: The StopFilterFactory should strip any comments, then strip 
> trailing whitespace. The stop word files supplied with the distribution 
> should be edited to conform to the supported comment format.
> {panel}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to