Exclusion search

2009-07-21 Thread ba3
Hi, In the documents which contain the volunteer information : Doc1 : volunteer krish volunteer john volunteer Raj ... Doc2 : volunteer krish volunteer Raj volunteer Ganesh Doc3 : volunteer krish volunteer Raj The documents having ONLY krish and Raj as the volunteers need to be found. As in a

indexing 100GB of data

2009-07-21 Thread m.harig
hello all We've got 100GB of data which has doc,txt,pdf,ppt,etc.., we've separate parser for each file format, so we're going to index those data by lucene. (since we scared of Nutch setup , thats why we didn't use it) My doubt is , will it be scalable when i index those dcouments ?

Re: PageRanking with Lucene

2009-07-21 Thread Grant Ingersoll
I'd probably look at the function package in Lucene. While the document boost can be used, it may not give you the granularity you need, as you only have something like 6 bits of representation. Some people have also done some things like a field with a single token that contains a payloa

Analysis Question

2009-07-21 Thread Christopher Condit
I'm trying to implement an analyzer that will compute a score based on vocabulary terms in the indexed content (ie a document field with more terms in the vocabulary will score higher). Although I can see the tokens I can't seem to access the document from the analyzer to set a new field on it a

Strange(?) behaviour using MultiFieldQueryParser

2009-07-21 Thread Philip Puffinburger
We have code (using Lucene 2.4.1) that will build a query that looks like: fielda:"ruz an"~2 OR fieldb:"ruz an"~2 OR fieldc:"ruz an"~2 When passed to a MultiFieldQueryParser and parsed it comes back looking like: fielda:"ruz an"~2 fieldb:"ruz an"~2 fieldc:ruz It seems that whenever

Re: Sorting field contating NULL values consumes field cache memory

2009-07-21 Thread Shai Erera
FWIW, I had implemented a sort-by-payload feature which performs quite well. It has a very small memory footprint (actually close to 0), and reads values from a payload. Payloads, at least from my experience, perform better than stored fields. On a comparison I've once made, the sort-by-payload fe

Re: Sorting field contating NULL values consumes field cache memory

2009-07-21 Thread Chris Hostetter
: Right now, you can't really do anything about it. In the future, with the : new FieldCache API that may go in, you could plug in a custom implementation : that makes tradeoffs for a sparse array of some kind. The docid is currently : the index into the array, but with a custom impl you may be ab

Re: Setting Boost values

2009-07-21 Thread AHMET ARSLAN
> We have indexed various field related information, such as > Title, Body , Meta text, H1, URLĀ  etc. > What should be the values for these fields? Boost value is multiplied with score. Or in other words it is a multiplication factor in score calculation. > Should they be relative? Yes. > Are

Re: Alternative way to simulate sorting without doing actual sort

2009-07-21 Thread Erick Erickson
Have you tried splitting your times into separate fields, perhaps one with MMDD and another with HHMM, then do a primary sort on the YYYMMDD and secondary on HHMM. That'll reduce your total unique values greatly and should improve your memory consumption. Best Erick On Tue, Jul 21, 2009 at 4:2

Re: Range query and a proximity search

2009-07-21 Thread ba3
Excellent !! Thanks for pointing me towards the ComplexPhraseQueryParser. --Regards Ba3 Ahmet Arslan wrote: > > >> Can you please suggest me some pointers as to how a range >> query combined with proximity be done. > > Your remedy is ComplexPhraseQueryParser that utilizes SpanQuery family. >

Setting Boost values

2009-07-21 Thread Kushal Dave
Hi, We are implementing a search engine for a huge dataset (approximately 50 million html pages). We have indexed various field related information, such as Title, Body , Meta text, H1, URL etc. Lucene provides the setBoost() function to give weightage to these fields. What should be the values f

Alternative way to simulate sorting without doing actual sort

2009-07-21 Thread Ganesh
Hello all I am sorting on datetime with minute resolution. It easily reaches the maximum heap size. I am having almost 100M records and it is using 1.5 GB. I am now in a situitation to stop sorting and to find some other alternative way. I tried adding document boost and field boost for date t

Re: Range query and a proximity search

2009-07-21 Thread AHMET ARSLAN
> Can you please suggest me some pointers as to how a range > query combined with proximity be done. Your remedy is ComplexPhraseQueryParser that utilizes SpanQuery family. https://issues.apache.org/jira/browse/LUCENE-1486 That accepts ranges, ORs, Wildcards inside Phrase queries. Using this new

Range query and a proximity search

2009-07-21 Thread ba3
Hi, Iam having around 100 documents which had undergone revisions. Want to find out the documents which have undergone more than 40 revisions. The documents are all text based and the first few lines in the document contain the revision details. For eg: revision 35 This is a document regardin