Re: Language detection

2013-06-27 Thread Jack Krupansky
Oops... sorry, I just realized this was on the Lucene-user list. My response was for Solr-ONLY! -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Thursday, June 27, 2013 1:11 PM To: java-user@lucene.apache.org Subject: Re: Language detection You can use the

Re: Language detection

2013-06-27 Thread Jack Krupansky
You can use the LangDetectLanguageIdentifierUpdateProcessorFactory update processor to redirect languages to alternate fields, and then set the non-English fields to be "ignored". But, the document would still be indexed, just without the redirected text fields. (Examples of using that update

RE: Language Detection for Analysis?

2009-08-10 Thread Teruhiko Kurosaka
A shameless self-promotion: http://basistech.com/language-identification/ No, it's not free. Sorry. We have Lucene-compatible Tokenizers for those languages too: http://basistech.com/lucene/How-to-build-a-multilingual-search-engine.pdf Contact me if you have questions. -kuro > -Original Me

Re: Language Detection for Analysis?

2009-08-09 Thread Lucas F. A. Teixeira
Google Translate just released (last week) its language API with translation and LANGUAGE DETECTION. :) It's very simple to use, and you can query it with some text to define witch language is it. Here is a simple example using groovy, but all you need is the url to query: http://groovyconsole.ap

Re: Language Detection for Analysis?

2009-08-07 Thread Grant Ingersoll
There are several free Language Detection libraries out there, as well as a few commercial ones. I think Karl Wettin has even written one as a plugin for Lucene. Nutch also has one, AIUI. I would just Google "language detection". Also see http://www.lucidimagination.com/search/?q=languag

Re: Language Detection for Analysis?

2009-08-06 Thread Otis Gospodnetic
Bradford, If I may: Have a look at http://www.sematext.com/products/language-identifier/index.html And/or http://www.sematext.com/products/multilingual-indexer/index.html Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP,

Re: Language Detection for Analysis?

2009-08-06 Thread Shai Erera
Thanks Robert for the explanation. I thought that you meant something different, like doing stemming in some sophisticated manner by somehow detecting the language. Doing these normalizations makes sense of course, especially if the letters look similar. Thanks again, Shai On Thu, Aug 6, 2009 at

Re: Language Detection for Analysis?

2009-08-06 Thread Robert Muir
Shai, I mean doing language-agnostic things that apply to all of these since they are based on the same writing system, like normalizing all yeh characters (arabic yeh, farsi yeh, alef maksura) to the same form, removing harakat, the kinds of things in ArabicNormalizationFilter and PersianNormaliza

Re: Language Detection for Analysis?

2009-08-06 Thread Shai Erera
Robert - can you elaborate on what you mean by "just treat it at the script level"? On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir wrote: > Bradford, there is an arabic analyzer in trunk. for farsi there is > currently a patch available: > http://issues.apache.org/jira/browse/LUCENE-1628 > > one o

Re: Language Detection for Analysis?

2009-08-06 Thread Robert Muir
Bradford, there is an arabic analyzer in trunk. for farsi there is currently a patch available: http://issues.apache.org/jira/browse/LUCENE-1628 one option is not to detect languages at all. it could be hard for short queries due to the languages you mentioned borrowing from each other. but you do

Re: Language detection library

2007-05-07 Thread Bob Carpenter
Anyone knows of a good language detection library that can detect what language a document (text) is ? Language detection is easy. It's just a simple text classification problem. One way you can do this is using Lucene itself. Create a so-called pseudo-document for each language consisting

RE: Language detection library

2007-05-04 Thread Mordo, Aviran (EXP N-NANNATEK)
Thank you, I got the natch plugin, and it is working great -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, May 03, 2007 4:17 PM To: java-user@lucene.apache.org Subject: Re: Language detection library LingPipe - commercial unless your data/product

Re: Language detection library

2007-05-03 Thread karl wettin
4 maj 2007 kl. 02.20 skrev Chris Lu: I suppose if a document is indexed as English or French, when users searching the document, we need to parse the query as English or French also? If you do some language specific token analysis such as stemming, yes. Detecting the language on such small t

Re: Language detection library

2007-05-03 Thread Chris Lu
I suppose if a document is indexed as English or French, when users searching the document, we need to parse the query as English or French also? -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.db

Re: Language detection library

2007-05-03 Thread karl wettin
3 maj 2007 kl. 22.06 skrev Mordo, Aviran (EXP N-NANNATEK): Anyone knows of a good language detection library that can detect what language a document (text) is ? I posted this some time back: https://issues.apache.org/jira/browse/LUCENE-826 A bit of proof-of-concept:ish, but it does the job

Re: Language detection library

2007-05-03 Thread Andrzej Bialecki
Jason Pump wrote: http://software.wise-guys.nl/libtextcat/ ... which is what Nutch implements in its language-identifier plugin. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___

Re: Language detection library

2007-05-03 Thread Jason Pump
http://software.wise-guys.nl/libtextcat/ Otis Gospodnetic wrote: LingPipe - commercial unless your data/product/service is free. Nutch language id plugin. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Origin

Re: Language detection library

2007-05-03 Thread Otis Gospodnetic
LingPipe - commercial unless your data/product/service is free. Nutch language id plugin. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: "Mordo, Aviran (EXP N-NANNATEK)" <[EMAIL PROTEC