Re: Using multiple drives and non-CFS format to improve search performance

2010-08-26 Thread Sanne Grinovero
Hi Stefan, you might want to consider org.apache.lucene.store.FileSwitchDirectory before going for the symlinks approach. Sorry I don't know the effect nor recommended file types, I would naively start setting the smallest on SSD, then perform tests, but that's possibly not the best scenario: under

Re: instantiated contrib

2010-08-26 Thread Li Li
" It is strange that it should take 20 second to gather fields," 20s including search and gather fields, it's the total time 2010/8/27 Karl Wettin : > My mail client died while sending this mail.. Sorry for any duplicate. > > It is strange that it should take 20 second to gather fields, this is th

Bettering search performance

2010-08-26 Thread Shelly_Singh
Hi, I have a lucene index of 100 million documents. But the document size is very small - 5 fields with 1 or 2 terms each. Only 1 field is analyzed and others are just simply indexed. The index is optimized to 2 segments and the total index size is 7GB. I open a searcher with a termsInfoDiviso

Re: instantiated contrib

2010-08-26 Thread Li Li
if I index only 7k documents, the time comparison: time1: 7602331019 time2: 4246878035 total1: 10736 total2: 7393 it seems II is faster than RAMDirectory. My indexed texts are all hotel names (chinese and english, litter french). it has about 100k terms. terms such as hotel is very frequent and ho

Re: Solr SynonymFilter in Lucene analyzer

2010-08-26 Thread Arun Rangarajan
Thanks, Lance. After exploring for a while, I used lucene's ShingleFilter followed by the SynonymFilter in Lucene in Action book. Then using the type attribute, I removed all the shingles which did not belong to any category. On Wed, Aug 18, 2010 at 10:28 PM, Lance Norskog wrote: > Yes, you need

Using multiple drives and non-CFS format to improve search performance

2010-08-26 Thread Stefan Nikolic
Hi everyone, I'm trying to figure out the effects on search performance of using the non-CFS format and spreading the various underlying files to different disks/media types. For example, I'm considering moving a segment's various .t* term-related files onto a solid-state drive, the .fdx/.fdt stor

Re: Batch Operation and Commit

2010-08-26 Thread Amin Mohammed-Coleman
Hi Erick Thanks for your response. I used the Lucene in Action 1st edition as a reference for batch indexing. I've just got my copy of the 2nd edition which mentions that there is no point in using RAM directory. Not saying I don't trust you :). I'll update my code to use the normal fs direc

Re: Batch Operation and Commit

2010-08-26 Thread Erick Erickson
I'm going to sidestep your question and ask why you're using a RAMDirectory in the first place. People often think it'll speed up their indexing because it's in RAM, but the normal FS-based indexing caches in RAM too, and you can use various settings governing segments, ramusage etc. to control how

Batch Operation and Commit

2010-08-26 Thread Amin Mohammed-Coleman
Hi I have a list of batch tasks that need to be executed. Each batch contains 1000 documents and basically I use a RAMDirectory based index writer, and at the end of adding 1000 documents to the memory i perform the following: ramWriter.commit(); indexWriter.addIndexesNoOptimize(ramW

Re: instantiated contrib

2010-08-26 Thread Karl Wettin
My mail client died while sending this mail.. Sorry for any duplicate. It is strange that it should take 20 second to gather fields, this is the only thing that really suprises me. I'd expect it to be instant compared to RAMDirectory. It is hard to say from the information you provided. Did

Span Query/Slop distance

2010-08-26 Thread Shashi Kant
Hello, I am familiar with the SpanQuery construct and set an upper Slop limit. 1. But when I get the hit results, is there any way I can access the actual slop and the span text itself in that particular hit. 2. Also it is possible to have multiple matches within the same document. So how do I acc

Re: lucene scanning

2010-08-26 Thread Erick Erickson
Why do you care? By that I mean that nothing you've written gives us any clue whether you need to do anything about making things faster. "Making things faster" is a laudable goal, but not worth worrying about until you can confidently state you have performance issues. And you've provided no deta

Largest Lucene installation?

2010-08-26 Thread Nigel
I'm curious about what the largest Lucene installations are, in terms of: - Greatest number of documents (i.e. X billion docs) - Largest data size (i.e. Y terabytes of indexes) - Most machines (i.e. Z shards or severs) Apart from general curiosity, the obvious follow-up question would be what app

Re: Calculate Term Co-occurrence Matrix

2010-08-26 Thread Aida Hota
ok, thank you Ivan!! On Tue, Aug 24, 2010 at 5:13 PM, Ivan Provalov wrote: > Aida, > > Right now it will do two term collocation only. > > Ivan > > > --- On Mon, 8/23/10, Aida Hota wrote: > > > From: Aida Hota > > Subject: Re: Calculate Term Co-occurrence Matrix > > To: java-user@lucene.apache

lucene scanning

2010-08-26 Thread suman.holani
hi , 1. whether any search query, will scan for all documents in the lucene indexes 2. I want to search query faster.So I thought of if I could reduce the number of docs , lucene needs to search for , when given some search parameters. It would act lil faster. Can we make subset (subindexe

instantiated contrib

2010-08-26 Thread Li Li
I have about 70k document, the total indexed size is about 15MB(the orginal text files' size). dir=new RAMDirectory(); IndexWriter write=new IndexWriter(dir,...; for(loop){ writer.addDocument(doc); } writer