On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen <jans...@parc.com> wrote:
> I thought that since I'm updating UpLib's Lucene code, I should tackle > the issue of document languages, as well. Right now I'm using an > off-the-shelf language identifier, textcat, to figure out which language > a Web page or PDF is (mainly) written in. I then want to analyze that > document with an appropriate analyzer. I'd then like to map to the > correct Lucene analyzer for that language, falling back to > StandardAnalyzer if the installed Lucene library doesn't have an > analyzer for that language. > > It would be *very* handy if Analyzer had a static method > > static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag); > I agree (not sure if it should be in Analyzer itself, maybe we could make an Analyzer for this)... i mean it sounds like what you want, is for it to work in a similar way to ResourceBundle's fallback mechanism? And I agree with your idea of rfc3066/4646, e.g. you might want to specify subtags like "word" (SmartChineseAnalyzer) or "ngram" (CJKAnalyzer) for chinese somehow? Shai Erera brought a similar idea up before, to use Locale, but my concerns are it would be limited by javas Locale mechanism... but we can figure this out. Maybe you want to create a JIRA issue to pursue this idea further? See http://wiki.apache.org/lucene-java/HowToContribute > Right now I'm consulting a hand-compiled mapping of > langtag-to-Lucene-classname to figure out which Analyzer to use. > Wearisome, and it will be out-of-date for future releases of Lucenen > which will presumably support more languages. > yes, but it also brings up interesting backwards compatibility challenges. Because if we add more analyzers, say EsperantoAnalyzer, if you upgrade lucene then suddenly your Esperanto queries are analyzed differently (whereas they were dealt with by StandardAnalyzer before). But this becomes less of a problem as we work on modularizing lucene, so we can remove Version from analyzers, and so you can just use an old analyzers jar file (such as 4.1) but upgrade your lucene core jar to say version 4.3. > > Secondly, if I've got an instance of a SnowballAnalyzer, there's no way > to look "inside" it, and see what language it's for. That's a problem > on the search side. My QueryParser is a subclass of > MultiFieldQueryParser, and it looks for a "special" FieldQuery on the > field "_query_language", i.e., "_query_language:de" to tell the query > parser to use a German analyzer on this query. What I'd like to be able > to do is interrogate the current analyzer attached to the query parser > instance, and throw an exception if it's not for the specified language. > I can do this for non-Snowball analyzers, because of the brittle > hand-compiled mapping mentioned above. But if it's a SnowballAnalyzer, > there's no way to tell what the language inside it is. So it would be > nice if SnowballAnalyzer grew a method > SnowballAnalyzer had more problems. its actually deprecated in trunk/branch_3x and instead there is an Analyzer for each language (English, Italian, etc), which now has stopwords lists, and sometimes special behavior (e.g. Turkish lowercases differently). Put more simply, its an implementation detail for ItalianAnalyzer that we implement the stemming with SnowballFilter. One day we might change it to use a less aggressive stemming algorithm (e.g. ItalianLightStemFilter) by default. I'd really like to see the stopword work finished, so that a > SnowballAnalyzer for a particular language has a decent set of > stopwords. > See above, I think this is finished? The remaining work is actually Solr integration. In trunk and branch_3x, all the analyzers have their own package, here's Italian: Source package: contains Analyzer that uses SnowballFilter(Italian) and loads Italian snowball stopwords by default. It also includes an alternative, less aggressive stemmer. http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/it/ The snowball stopwords were all added to the resources directory. This is where ItalianAnalyzer loads its set of stopwords from: http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ <http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/> -- Robert Muir rcm...@gmail.com