This won't be *really* helpful, but I remember this being discussed at some
length a while ago. You'd be able to see some good info if you searched the
list archive, probably for language

I didn't pay much attention since this isn't something I'm concerned with
lately, so I can't be much real help...

Best
Erick

On 10/13/06, Antony Bowesman <[EMAIL PROTECTED]> wrote:

Hello,

I'm new to Lucene and wanted some advice on analyzers, stemmers and
language
analysis.  I've got LIA, so have read it's chapters.

I am writing a framework that needs to be able to index documents from a
range
of languages where just the character set of the document is known.  Has
anyone
looked at or is using language analysis to determine the language of a
document
in ISO-8859-1.

Is it worth doing or does StandardAnalyzer cope well with most European
languages as long as it is provided with a suitable multi-lingual set of
stop words.

What about stemming?  I see Google now says it does stemming, but again
here
language detection seems to be a stumbling block in the way of choosing
the
right stemmer.  Does stemming provide much of an index size reduction and
is it
actually useful in search?

Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to