[jira] Commented: (SOLR-1860) improve stopwords list handling

Lance Norskog (JIRA) Fri, 20 Aug 2010 20:47:43 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900969#action_12900969
 ]


Lance Norskog commented on SOLR-1860:
-------------------------------------

This is a nice piece of work. One thing I've learned is that configurations 
should be as flat and transparent as possible. Pushing all of these word lists 
out of the classes and into files is a great improvement.  The Greek Analyzer, 
for example, is (was) nothing but a default list of stopwords.

But, having the stopwords as text files runs smack into character encoding 
wackiness (why, yes, I do use windows). Can the file format or importer at 
least support the XML or URL notations for Unicode characters? Maybe a list of 
words that include prot&#x0274; ge for protege?


> improve stopwords list handling
> -------------------------------
>
>                 Key: SOLR-1860
>                 URL: https://issues.apache.org/jira/browse/SOLR-1860
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.1
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: SOLR-1860.patch
>
>
> Currently Solr makes it easy to use english stopwords for StopFilter or 
> CommonGramsFilter.
> Recently in lucene, we added stopwords lists (mostly, but not all from 
> snowball) to all the language analyzers.
> So it would be nice if a user can easily specify that they want to use a 
> french stopword list, and use it for StopFilter or CommonGrams.
> The ones from snowball, are however formatted in a different manner than the 
> others (although in Lucene we have parsers to deal with this).
> Additionally, we abstract this from Lucene users by adding a static 
> getDefaultStopSet to all analyzers.
> There are two approaches, the first one I think I prefer the most, but I'm 
> not sure it matters as long as we have good examples (maybe a foreign 
> language example schema?)
> 1. The user would specify something like:
>  <filter class="solr.StopFilterFactory" 
> fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
>  This would just grab the CharArraySet from the FrenchAnalyzer's 
> getDefaultStopSet method, who cares where it comes from or how its loaded.
> 2. We add support for snowball-formatted stopwords lists, and the user could 
> something like:
> <filter class="solr.StopFilterFactory" 
> words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" 
> ... />
> The disadvantage to this is they have to know where the list is, what format 
> its in, etc. For example: snowball doesn't provide Romanian or Turkish
> stopword lists to go along with their stemmers, so we had to add our own.
> Let me know what you guys think, and I will create a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-1860) improve stopwords list handling

Reply via email to