Hi,
In the documents which contain the volunteer information :
Doc1 :
volunteer krish
volunteer john
volunteer Raj
...
Doc2 :
volunteer krish
volunteer Raj
volunteer Ganesh
Doc3 :
volunteer krish
volunteer Raj
The documents having ONLY krish and Raj as the volunteers need to be found.
As in a
hello all
We've got 100GB of data which has doc,txt,pdf,ppt,etc.., we've
separate parser for each file format, so we're going to index those data by
lucene. (since we scared of Nutch setup , thats why we didn't use it) My
doubt is , will it be scalable when i index those dcouments ?
I'd probably look at the function package in Lucene. While the
document boost can be used, it may not give you the granularity you
need, as you only have something like 6 bits of representation. Some
people have also done some things like a field with a single token
that contains a payloa
I'm trying to implement an analyzer that will compute a score based on
vocabulary terms in the indexed content (ie a document field with more terms in
the vocabulary will score higher). Although I can see the tokens I can't seem
to access the document from the analyzer to set a new field on it a
We have code (using Lucene 2.4.1) that will build a query that looks like:
fielda:"ruz an"~2 OR fieldb:"ruz an"~2 OR fieldc:"ruz an"~2
When passed to a MultiFieldQueryParser and parsed it comes back looking
like:
fielda:"ruz an"~2 fieldb:"ruz an"~2 fieldc:ruz
It seems that whenever
FWIW, I had implemented a sort-by-payload feature which performs quite well.
It has a very small memory footprint (actually close to 0), and reads values
from a payload. Payloads, at least from my experience, perform better than
stored fields.
On a comparison I've once made, the sort-by-payload fe
: Right now, you can't really do anything about it. In the future, with the
: new FieldCache API that may go in, you could plug in a custom implementation
: that makes tradeoffs for a sparse array of some kind. The docid is currently
: the index into the array, but with a custom impl you may be ab
> We have indexed various field related information, such as
> Title, Body , Meta text, H1, URLĀ etc.
> What should be the values for these fields?
Boost value is multiplied with score. Or in other words it is a multiplication
factor in score calculation.
> Should they be relative?
Yes.
> Are
Have you tried splitting your times into separate fields, perhaps one with
MMDD and another with HHMM, then do a primary sort on the YYYMMDD and
secondary on HHMM. That'll reduce your total unique values greatly and
should improve your memory consumption.
Best
Erick
On Tue, Jul 21, 2009 at 4:2
Excellent !!
Thanks for pointing me towards the ComplexPhraseQueryParser.
--Regards
Ba3
Ahmet Arslan wrote:
>
>
>> Can you please suggest me some pointers as to how a range
>> query combined with proximity be done.
>
> Your remedy is ComplexPhraseQueryParser that utilizes SpanQuery family.
>
Hi,
We are implementing a search engine for a huge dataset (approximately 50
million html pages).
We have indexed various field related information, such as Title, Body ,
Meta text, H1, URL etc.
Lucene provides the setBoost() function to give weightage to these fields.
What should be the values f
Hello all
I am sorting on datetime with minute resolution. It easily reaches the maximum
heap size. I am having almost 100M records and it is using 1.5 GB. I am now in
a situitation to stop sorting and to find some other alternative way.
I tried adding document boost and field boost for date t
> Can you please suggest me some pointers as to how a range
> query combined with proximity be done.
Your remedy is ComplexPhraseQueryParser that utilizes SpanQuery family.
https://issues.apache.org/jira/browse/LUCENE-1486
That accepts ranges, ORs, Wildcards inside Phrase queries.
Using this new
Hi,
Iam having around 100 documents which had undergone revisions. Want to find
out the documents which have undergone more than 40 revisions. The documents
are all text based and the first few lines in the document contain the
revision details. For eg:
revision 35
This is a document regardin
14 matches
Mail list logo