Re: Unable to add more than 1 document to Index

2008-04-23 Thread Hasan Diwan
Anshum: On 23/04/2008, Anshum <[EMAIL PROTECTED]> wrote: > The issue seems to be with the initialization of the index writer, try > initializing it with a the last parameter as false i.e. > *writer = new IndexWriter(indexLocation, new StandardAnalyzer(), false); writer = new IndexWriter(indexL

MergePolicy Exception

2008-04-23 Thread Jamie
Hi there I am using the latest version of Lucene and have ten threads indexing documents. I am getting the following errors appearing on a continual basis during the indexing process: Exception in thread "Thread-569" org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundEx

Re: Unable to add more than 1 document to Index

2008-04-23 Thread Anshum
Hi Hasan, The issue seems to be with the initialization of the index writer, try initializing it with a the last parameter as false i.e. *writer = new IndexWriter(indexLocation, new StandardAnalyzer(), false); *If you initialize it with the last argument as true, it creates a new index each time

Re: Binding lucene instance/threads to a particular processor(or core)

2008-04-23 Thread Anshum
Hi Glen, I am using Red Hat Enterprise Linux ES release 4, kernel : 2.6.9-55.ELsmp. Its a 32 bit Dual processor, HT enabled machine with 12G of RAM. The JVM would be : Java HotSpot(TM) Client VM (build 1.6.0_02-ea-b02, mixed mode) and Yes, I am using a single searcher instance for all searches.'

Unable to add more than 1 document to Index

2008-04-23 Thread Hasan Diwan
writer = new IndexWriter(indexLocation, new StandardAnalyzer(), true); String string = request.getParameter("text"); this.log("Text is "+string); Date date = new Date(); String dateString = DateTools.dateToString(date,

Re: Lucene and Google Web 1T 5 Gram

2008-04-23 Thread Rafael Turk
Hi Mathieu, *What do you wont to do?* An spell checker and related keyword suggestion If you wont an ngram => popularity map, just use a berkley DB, and use this information in your Lucene application. Lucene is a reversed index, Berkeley DB an index. *Great ideia! Berkeley DB is definitely a t

Re: Lucene and Google Web 1T 5 Gram

2008-04-23 Thread Rafael Turk
Thanks Julien, I´ll definitely give it a try!!! []s Rafael On Wed, Apr 23, 2008 at 8:38 AM, Julien Nioche < [EMAIL PROTECTED]> wrote: > Hi Raphael, > > We initially tried to do the same but ended up developing our own API for > querying the Web 1T. You can find more details on > http://digita

MoreLikeThis patch to support boost factor

2008-04-23 Thread Jonathan Ariel
This is a patch I made to be able to boost the terms with a specific factor beside the relevancy returned by MoreLikeThis. This is helpful when having more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) can be boosted more than words in the field B (i.e. Description). Any f

Re: Occasional Hang in IndexWriter.close()

2008-04-23 Thread Michael McCandless
Hi Stu, I just committed the fix for this, on 2.4 & 2.3.2. If you're able to test that this fixes your hang that'd be great. If not that's fine (I got a unit test to reproduce the issue). It's quite easy: svn checkout https://svn.apache.org/repos/asf/lucene/java/branches/ lucene_2_3

Re: Binding lucene instance/threads to a particular processor(or core)

2008-04-23 Thread Glen Newton
Hi Anshum, 2008/4/23 Anshum <[EMAIL PROTECTED]>: > Hi Glen, > > As far as stats for index/search are concerned, here they are: > * Yes, it is a web based application > * I am currently facing issues when the number of concurrent searches goes > high. The search is not able to handle over 2.5 s

Re: MoreLikeThis over a subset of documents

2008-04-23 Thread Karl Wettin
Jonathan Ariel skrev: Yes, it will be too much to do in real time, but it is a good idea tough. I don't know if a vector of term frequencies is stored with the document. Because I could search on the index to get the subset of documents and then take the term frequencies from there. In that case

Re: Occasional Hang in IndexWriter.close()

2008-04-23 Thread Michael McCandless
Stu Hood wrote: Thank you very much for looking into this issue! You're welcome! Thank you for catching it & reporting it. I originally switched to the SerialMergeScheduler to try and work around this bug: http://lucene.markmail.org/message/ awkkunr7j24nh4qj . I switched back to the Conc

Re: Lucene and Google Web 1T 5 Gram

2008-04-23 Thread Mathieu Lecarme
Rafael Turk a écrit : Hi Folks, I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams(single words) to five-grams) I´m loading each ngram (each row is a ngram) as an

Re: MoreLikeThis over a subset of documents

2008-04-23 Thread Jonathan Ariel
Yes, it will be too much to do in real time, but it is a good idea tough. I don't know if a vector of term frequencies is stored with the document. Because I could search on the index to get the subset of documents and then take the term frequencies from there. In that case I could change MoreLike

Re: Lucene and Google Web 1T 5 Gram

2008-04-23 Thread Julien Nioche
Hi Raphael, We initially tried to do the same but ended up developing our own API for querying the Web 1T. You can find more details on http://digitalpebble.com/resources.html There could be a way to reuse elements from Lucene e.g. the Term index only but I could not find an obvious way to achieve

A problem about additonal info(after some modification for lucene)

2008-04-23 Thread 王建新
I modified some lucene's code to make lucene have the new use like: doc=new Document(); byte[] additionalInfo=new byte[]{'x','x','x'}; doc.add(new Field("field1","aa aa",Field.Store.YES,Field.Index.TOKENIZED,Field.TermVector.NO,additionalInfo)); I change the *.frp file as: if

Lucene and Google Web 1T 5 Gram

2008-04-23 Thread Rafael Turk
Hi Folks, I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams(single words) to five-grams) I´m loading each ngram (each row is a ngram) as an individual Document. Th

Re: MoreLikeThis over a subset of documents

2008-04-23 Thread Karl Wettin
Jonathan Ariel skrev: Smart idea, but it won't help me. I have almost 50 categories and eventually I would like to "filter" not just on category but maybe also on language, etc. Karl: what do you mean by measure the distance between the term vectors and cluster them in real time? I mean exactly

Re: Binding lucene instance/threads to a particular processor(or core)

2008-04-23 Thread Toke Eskildsen
On Tue, 2008-04-22 at 09:40 +0530, Anshum wrote: > Any other suggestions for handling a concurrency of over 7 search requests > per second for an index size of over 15Gigs containing over 13 million > records? Our index is 30GB+ with 9 million records and a machine handles an average search in abo