Re: Language Identifier with Lucene?

2011-10-24 Thread Luca Rondanini
nguages/character in the world for *one person*... > > Regards, > Mead > > > On Sun, Oct 23, 2011 at 1:27 AM, Petite Abeille >wrote: > > > > > On Oct 22, 2011, at 2:49 AM, Luca Rondanini wrote: > > > > > I usually use Nutch for this but, jus

Language Identifier with Lucene?

2011-10-21 Thread Luca Rondanini
Hi all, I usually use Nutch for this but, just for fun, I tried to create a language identifier based on Lucene only. I had a really small set of "training data": 10 files (roughly 2M each) for 10 languages. I indexed those files using an NGram analyzer. I have to say that I was not expecting mu

Re: Best practices for multiple languages?

2011-01-19 Thread Luca Rondanini
why not just using the StandardAnalyzer? it works pretty well even with Asian languages! On Wed, Jan 19, 2011 at 12:23 AM, Shai Erera wrote: > If you index documents, each in a different language, but all its fields > are > of the same language, then what you can do is the following: > > Creat

Re: best practice: 1.4 billions documents

2010-11-22 Thread Luca Rondanini
eheheheh, 1.4 billion of documents = 1,400,000,000 documents for almost 2T = 2 therabites = 2000 gigas on HD! On Mon, Nov 22, 2010 at 10:16 AM, wrote: > > of course I will distribute my index over many machines: > > store everything on > > one computer is just crazy, 1.4B docs is going to b

Re: best practice: 1.4 billions documents

2010-11-22 Thread Luca Rondanini
> There is an acknowledged/proven bug with a small unit test, but there > is > > some > > > disagreement about the internal reasons it fails. I have been unable to > > > generate further discussion or a resolution. This was supposed to be > > added as a > >

Re: best practice: 1.4 billions documents

2010-11-21 Thread Luca Rondanini
t; We have a couple billion docs in our archives as well. Breaking them up by > day worked well for us, but you'll need to do something. > > -----Original Message- > From: Luca Rondanini [mailto:luca.rondan...@gmail.com] > Sent: Sunday, November 21, 2010 8:13 P

Re: best practice: 1.4 billions documents

2010-11-21 Thread Luca Rondanini
> On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini > wrote: > > Hi everybody, > > > > I really need some good advice! I need to index in lucene something like > 1.4 > > billions documents. I had experience in lucene but I've never worked with > > such a bi

best practice: 1.4 billions documents

2010-11-21 Thread Luca Rondanini
Hi everybody, I really need some good advice! I need to index in lucene something like 1.4 billions documents. I had experience in lucene but I've never worked with such a big number of documents. Also this is just the number of docs at "start-up": they are going to grow and fast. I don't have to

Re: strange MultiFieldQueryParser error: java.lang.Integer

2007-08-03 Thread Luca Rondanini
Sometimes I feel stupid! ;) Thank you very much! Luca testn wrote: Boost must be Map Luca123 wrote: Hi all, I've always used the MultiFieldQueryParser class without problems but now i'm experiencing a strange problem. This is my code: Map boost = new HashMap(); boost.put("field1",5); boos

strange MultiFieldQueryParser error: java.lang.Integer

2007-08-03 Thread Luca Rondanini
Hi all, I've always used the MultiFieldQueryParser class without problems but now i'm experiencing a strange problem. This is my code: Map boost = new HashMap(); boost.put("field1",5); boost.put("field2",1); Analyzer analyzer = new StandardAnalyzer(STOP_WORDS); String[] s_fields = new String[2