Re: Language detection

2013-06-27 Thread Jack Krupansky
Oops... sorry, I just realized this was on the Lucene-user list. My response was for Solr-ONLY! -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Thursday, June 27, 2013 1:11 PM To: java-user@lucene.apache.org Subject: Re: Language detection You can use the

Re: Language detection

2013-06-27 Thread Jack Krupansky
an explicit check before sendting the document to Solr. Tika also has language detection, so you could call Tika from an external process before sending the document to Solr. -- Jack Krupansky -Original Message- From: Hang Mang Sent: Thursday, June 27, 2013 11:45 AM To: java-user@

Language detection

2013-06-27 Thread Hang Mang
Hello, is there some kind of a filter or component that I could use to filter non-english text? I have a preprocessing step that I only want to index English documents. Best, Gucko

Re: Lightweight detection of whether a keyword is CJK or not (language detection)

2013-03-11 Thread Gili Nachum
This character lies in the CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A block. Added extensions detection, I assume (not really knowing) that all of these characters are not phonetic as well. import java.lang.Character.UnicodeBlock; import java.util.Arrays; import java.util.HashSet; import java.util.Set; i

Re: Lightweight detection of whether a keyword is CJK or not (language detection)

2013-03-10 Thread Trejkaz
On Sun, Mar 10, 2013 at 8:19 PM, Gili Nachum wrote: > Answering myself for next generations' sake. > Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS does the job. How about 㒨? TX - To unsubscribe, e-mail: java-user-unsubscr...@lu

Re: Lightweight detection of whether a keyword is CJK or not (language detection)

2013-03-10 Thread Gili Nachum
Answering myself for next generations' sake. Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS does the job. Example: import junit.framework.Assert; import org.junit.Test; public class DetectCJK { @Test public void test1() { Assert.assertEquals(Character.UnicodeBlock.BASIC_LATIN, Ch

RE: Language Detection for Analysis?

2009-08-10 Thread Teruhiko Kurosaka
Original Message- > From: Bradford Stephens [mailto:bradfordsteph...@gmail.com] > Sent: Thursday, August 06, 2009 12:46 PM > To: solr-u...@lucene.apache.org; java-user@lucene.apache.org > Subject: Language Detection for Analysis? > > Hey there, > > We're trying to

Re: Language Detection for Analysis?

2009-08-09 Thread Lucas F. A. Teixeira
Google Translate just released (last week) its language API with translation and LANGUAGE DETECTION. :) It's very simple to use, and you can query it with some text to define witch language is it. Here is a simple example using groovy, but all you need is the url to query:

Re: Language Detection for Analysis?

2009-08-07 Thread Grant Ingersoll
There are several free Language Detection libraries out there, as well as a few commercial ones. I think Karl Wettin has even written one as a plugin for Lucene. Nutch also has one, AIUI. I would just Google "language detection". Also see http://www.lucidimagination.com

Re: Language Detection for Analysis?

2009-08-06 Thread Otis Gospodnetic
, NER, IR - Original Message > From: Bradford Stephens > To: solr-u...@lucene.apache.org; java-user@lucene.apache.org > Sent: Thursday, August 6, 2009 3:46:21 PM > Subject: Language Detection for Analysis? > > Hey there, > > We're trying to add foreign

Re: Language Detection for Analysis?

2009-08-06 Thread Shai Erera
> > >> > We're trying to add foreign language support into our new search > >> > engine -- languages like Arabic, Farsi, and Urdu (that don't work with > >> > standard analyzers). But our data source doesn't tell us

Re: Language Detection for Analysis?

2009-08-06 Thread Robert Muir
to our new search >> > engine -- languages like Arabic, Farsi, and Urdu (that don't work with >> > standard analyzers). But our data source doesn't tell us which >> > languages we're actually collecting -- we just get blocks of text. Has >> > an

Re: Language Detection for Analysis?

2009-08-06 Thread Shai Erera
gine -- languages like Arabic, Farsi, and Urdu (that don't work with > > standard analyzers). But our data source doesn't tell us which > > languages we're actually collecting -- we just get blocks of text. Has > > anyone here worked on language detection so we can f

Re: Language Detection for Analysis?

2009-08-06 Thread Robert Muir
trying to add foreign language support into our new search > engine -- languages like Arabic, Farsi, and Urdu (that don't work with > standard analyzers). But our data source doesn't tell us which > languages we're actually collecting -- we just get blocks of text. Has > anyon

Language Detection for Analysis?

2009-08-06 Thread Bradford Stephens
ext. Has anyone here worked on language detection so we can figure out what analyzers to use? Are there commercial solutions? Much appreciated! -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science

Re: Free software for language detection

2009-07-06 Thread Vinicius Carvalho
pache.org/jira/browse/LUCENE-1039that > I've successfully used for language detection of user queries. > > karl > > 27 mar 2009 kl. 18.35 skrev Boris Aleksandrovsky: > > > Lisheng, >> >> You might want to look at the Nutch LanguageID plugin >> (h

Re: Free software for language detection

2009-03-29 Thread Karl Wettin
You can also look at https://issues.apache.org/jira/browse/LUCENE-1039 that I've successfully used for language detection of user queries. karl 27 mar 2009 kl. 18.35 skrev Boris Aleksandrovsky: Lisheng, You might want to look at the Nutch LanguageID plugin (http://wiki.apach

Re: Free software for language detection

2009-03-27 Thread Boris Aleksandrovsky
to:jochen.sc...@gmail.com]on Behalf Of > Jochen Frey > Sent: Friday, March 27, 2009 10:04 AM > To: java-user@lucene.apache.org > Subject: Re: Free software for language detection > > > Lisheng, > > Here's a package you could take a look at. I have used it in the past and i

RE: Free software for language detection

2009-03-27 Thread Zhang, Lisheng
Thanks very much! -Original Message- From: jochen.sc...@gmail.com [mailto:jochen.sc...@gmail.com]on Behalf Of Jochen Frey Sent: Friday, March 27, 2009 10:04 AM To: java-user@lucene.apache.org Subject: Re: Free software for language detection Lisheng, Here's a package you could t

Re: Free software for language detection

2009-03-27 Thread Jochen Frey
sheng.zh...@broadvision.com> wrote: > Hi, > > Are you aware of any free software for language detection (given certain > text, see if it is French, or Japanese)? I saw Bob Carpenter's previous > mail which explained the principle nicely, but could not locate free tools? > >

Free software for language detection

2009-03-27 Thread Zhang, Lisheng
Hi, Are you aware of any free software for language detection (given certain text, see if it is French, or Japanese)? I saw Bob Carpenter's previous mail which explained the principle nicely, but could not locate free tools? Thanks very much for helps, Li

Re: Language detection library

2007-05-07 Thread Bob Carpenter
Anyone knows of a good language detection library that can detect what language a document (text) is ? Language detection is easy. It's just a simple text classification problem. One way you can do this is using Lucene itself. Create a so-called pseudo-document for each language consi

RE: Language detection library

2007-05-04 Thread Mordo, Aviran (EXP N-NANNATEK)
Thank you, I got the natch plugin, and it is working great -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, May 03, 2007 4:17 PM To: java-user@lucene.apache.org Subject: Re: Language detection library LingPipe - commercial unless your data/product

Re: Language detection library

2007-05-03 Thread karl wettin
of a good language detection library that can detect what > language a document (text) is ? I posted this some time back: https://issues.apache.org/jira/browse/LUCENE-826 A bit of proof-of-concept:ish, but it does the job well if you ask me. Uses Weka (GPL) and requires at least 150 char

Re: Language detection library

2007-05-03 Thread Chris Lu
://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes On 5/3/07, karl wettin <[EMAIL PROTECTED]> wrote: 3 maj 2007 kl. 22.06 skrev Mordo, Aviran (EXP N-NANNATEK): > Anyone knows of a good language detectio

Re: Language detection library

2007-05-03 Thread karl wettin
3 maj 2007 kl. 22.06 skrev Mordo, Aviran (EXP N-NANNATEK): Anyone knows of a good language detection library that can detect what language a document (text) is ? I posted this some time back: https://issues.apache.org/jira/browse/LUCENE-826 A bit of proof-of-concept:ish, but it does the

Re: Language detection library

2007-05-03 Thread Andrzej Bialecki
Jason Pump wrote: http://software.wise-guys.nl/libtextcat/ ... which is what Nutch implements in its language-identifier plugin. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___

Re: Language detection library

2007-05-03 Thread Jason Pump
- Original Message From: "Mordo, Aviran (EXP N-NANNATEK)" <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, May 3, 2007 4:06:04 PM Subject: Language detection library Anyone knows of a good language detection library that can detect what language a do

Re: Language detection library

2007-05-03 Thread Otis Gospodnetic
t; <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, May 3, 2007 4:06:04 PM Subject: Language detection library Anyone knows of a good language detection library that can detect what language a document (text) is ?

Language detection library

2007-05-03 Thread Mordo, Aviran (EXP N-NANNATEK)
Anyone knows of a good language detection library that can detect what language a document (text) is ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analyzers and multiple languages (language detection)

2006-11-21 Thread Bob Carpenter
Antony Bowesman wrote: Hello, I'm new to Lucene and wanted some advice on analyzers, stemmers and language analysis. I've got LIA, so have read it's chapters. I am writing a framework that needs to be able to index documents from a range of languages where just the character set of the docu