[
https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901103#action_12901103
]
Lance Norskog commented on SOLR-1860:
-------------------------------------
bq. What wackiness? The files are all unicode UTF-8, which windows too supports.
'Supports' does not mean 'you can get it done without a pounding headache'.
UTF-8 is not the default and you cannot make it the default. I'm guessing some
linux editors don't understand the funky binary starting bytes that mark a
UTF-8 file. Having UTF-8 characters in the Java source blows up also. An XML
file format would go a long way to useability.
.
> improve stopwords list handling
> -------------------------------
>
> Key: SOLR-1860
> URL: https://issues.apache.org/jira/browse/SOLR-1860
> Project: Solr
> Issue Type: Improvement
> Components: Schema and Analysis
> Affects Versions: 3.1
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Attachments: SOLR-1860.patch
>
>
> Currently Solr makes it easy to use english stopwords for StopFilter or
> CommonGramsFilter.
> Recently in lucene, we added stopwords lists (mostly, but not all from
> snowball) to all the language analyzers.
> So it would be nice if a user can easily specify that they want to use a
> french stopword list, and use it for StopFilter or CommonGrams.
> The ones from snowball, are however formatted in a different manner than the
> others (although in Lucene we have parsers to deal with this).
> Additionally, we abstract this from Lucene users by adding a static
> getDefaultStopSet to all analyzers.
> There are two approaches, the first one I think I prefer the most, but I'm
> not sure it matters as long as we have good examples (maybe a foreign
> language example schema?)
> 1. The user would specify something like:
> <filter class="solr.StopFilterFactory"
> fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
> This would just grab the CharArraySet from the FrenchAnalyzer's
> getDefaultStopSet method, who cares where it comes from or how its loaded.
> 2. We add support for snowball-formatted stopwords lists, and the user could
> something like:
> <filter class="solr.StopFilterFactory"
> words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball"
> ... />
> The disadvantage to this is they have to know where the list is, what format
> its in, etc. For example: snowball doesn't provide Romanian or Turkish
> stopword lists to go along with their stemmers, so we had to add our own.
> Let me know what you guys think, and I will create a patch.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]