Oops... sorry, I just realized this was on the Lucene-user list. My response
was for Solr-ONLY!
-- Jack Krupansky
-Original Message-
From: Jack Krupansky
Sent: Thursday, June 27, 2013 1:11 PM
To: java-user@lucene.apache.org
Subject: Re: Language detection
You can use the
You can use the LangDetectLanguageIdentifierUpdateProcessorFactory update
processor to redirect languages to alternate fields, and then set the
non-English fields to be "ignored". But, the document would still be
indexed, just without the redirected text fields.
(Examples of using that update
A shameless self-promotion:
http://basistech.com/language-identification/
No, it's not free. Sorry.
We have Lucene-compatible Tokenizers for those languages too:
http://basistech.com/lucene/How-to-build-a-multilingual-search-engine.pdf
Contact me if you have questions.
-kuro
> -Original Me
Google Translate just released (last week) its language API with translation
and LANGUAGE DETECTION.
:)
It's very simple to use, and you can query it with some text to define witch
language is it.
Here is a simple example using groovy, but all you need is the url to
query: http://groovyconsole.ap
There are several free Language Detection libraries out there, as well
as a few commercial ones. I think Karl Wettin has even written one as
a plugin for Lucene. Nutch also has one, AIUI. I would just Google
"language detection".
Also see http://www.lucidimagination.com/search/?q=languag
Bradford,
If I may:
Have a look at http://www.sematext.com/products/language-identifier/index.html
And/or http://www.sematext.com/products/multilingual-indexer/index.html
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP,
Thanks Robert for the explanation. I thought that you meant something
different, like doing stemming in some sophisticated manner by somehow
detecting the language. Doing these normalizations makes sense of course,
especially if the letters look similar.
Thanks again,
Shai
On Thu, Aug 6, 2009 at
Shai, I mean doing language-agnostic things that apply to all of these
since they are based on the same writing system, like normalizing all
yeh characters (arabic yeh, farsi yeh, alef maksura) to the same form,
removing harakat, the kinds of things in ArabicNormalizationFilter and
PersianNormaliza
Robert - can you elaborate on what you mean by "just treat it at the script
level"?
On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir wrote:
> Bradford, there is an arabic analyzer in trunk. for farsi there is
> currently a patch available:
> http://issues.apache.org/jira/browse/LUCENE-1628
>
> one o
Bradford, there is an arabic analyzer in trunk. for farsi there is
currently a patch available:
http://issues.apache.org/jira/browse/LUCENE-1628
one option is not to detect languages at all.
it could be hard for short queries due to the languages you mentioned
borrowing from each other.
but you do
Anyone knows of a good language detection library that can detect what
language a document (text) is ?
Language detection is easy. It's just a simple
text classification problem.
One way you can do this is using Lucene
itself. Create a so-called pseudo-document
for each language consisting
Thank you, I got the natch plugin, and it is working great
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 03, 2007 4:17 PM
To: java-user@lucene.apache.org
Subject: Re: Language detection library
LingPipe - commercial unless your data/product
4 maj 2007 kl. 02.20 skrev Chris Lu:
I suppose if a document is indexed as English or French,
when users searching the document,
we need to parse the query as English or French also?
If you do some language specific token analysis such as stemming, yes.
Detecting the language on such small t
I suppose if a document is indexed as English or French,
when users searching the document,
we need to parse the query as English or French also?
--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.db
3 maj 2007 kl. 22.06 skrev Mordo, Aviran (EXP N-NANNATEK):
Anyone knows of a good language detection library that can detect what
language a document (text) is ?
I posted this some time back:
https://issues.apache.org/jira/browse/LUCENE-826
A bit of proof-of-concept:ish, but it does the job
Jason Pump wrote:
http://software.wise-guys.nl/libtextcat/
... which is what Nutch implements in its language-identifier plugin.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___
http://software.wise-guys.nl/libtextcat/
Otis Gospodnetic wrote:
LingPipe - commercial unless your data/product/service is free.
Nutch language id plugin.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
- Origin
LingPipe - commercial unless your data/product/service is free.
Nutch language id plugin.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
- Original Message
From: "Mordo, Aviran (EXP N-NANNATEK)" <[EMAIL PROTEC
18 matches
Mail list logo