Re: Language detection

Jack Krupansky Thu, 27 Jun 2013 10:11:53 -0700

You can use the LangDetectLanguageIdentifierUpdateProcessorFactory updateprocessor to redirect languages to alternate fields, and then set thenon-English fields to be "ignored". But, the document would still beindexed, just without the redirected text fields.

(Examples of using that update processor are in my book - but not the"ignored" step.)


There is also a Tika-specific processor as well:
TikaLanguageIdentifierUpdateProcessorFactory

If you really want to completely suppress the indexing of documentscontaining non-English text, you'll have to make an explicit check beforesendting the document to Solr. Tika also has language detection, so youcould call Tika from an external process before sending the document toSolr.


-- Jack Krupansky

-----Original Message-----From: Hang Mang

Sent: Thursday, June 27, 2013 11:45 AM
To: java-user@lucene.apache.org
Subject: Language detection

Hello,

is there some kind of a filter or component that I could use to filter
non-english text? I have a preprocessing step that I only want to index
English documents.

Best,

Gucko


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Language detection

Reply via email to