Thanks to all for the replies.
I thought of a mechanism to achieve the results without reindexing or
updating the documents.
search1 = boolean query of (vol krish + vol Raj)
search2 = boolean query(vol - (vol krish and vol Raj))
Removing the results of search2 from search1 gave the desired resu
Folks,
Thank you so much for your replay. We will share this with management.
-Pedro
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Thu 7/23/2009 4:41 PM
To: java-user@lucene.apache.org
Subject: Re: A question about the relevancy
Also, see http://wiki.a
Also, see http://wiki.apache.org/lucene-java/ScoresAsPercentages. The
relevancy here is that comparing scores across different queries is fairly
meaningless, even if you *do* know how that score was arrived at...
Best
Erick
On Thu, Jul 23, 2009 at 6:17 PM, Otis Gospodnetic <
otis_gospodne...@yaho
Hi Pedro,
Lucene's Explanation will show you all the juicy details:
http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Scorer.html#explain(int)
But with a query like that, I'm not sure if you'll be able to follow
everything. Maybe pick a super simple pair of queries instead,
Hi there,
I have a questionÂ… we have two querys which only different is the fact that
Query_1 includes phrase queries where Query_2 has the phrase query but
converted into a Boolean query.
When each query is executed, Query_1 gives a relevancy of 1.0 and Query_2 gives
one of 0.34. The questio
> Couldn't you maybe get the same effect using some clever term boosting?
>
> I.. think something like
>
> "Term 1" OR "Term 2" OR "Term 3" ^ .25
>
> would return in almost the exact order that you are asking for here, with
> the only real difference being that you would have some matches for only
Hi,
Thanks Shai and Mike for your suggestions. I went with Shai's second
approach. However, I'm confronted with this now:
After deleting that document from the index, I also delete it from a
copy of the directory that contained the original documents. With
this, I expected that both the directory
Looking at what you wrote:
I am doing a weighting system where I rank documents that have Term 1 AND
Term 2 AND Term 3 more highly than documents that have just Term 1 AND Term
2, and more highly than documents that just have Term 1 OR Term 2 but not
both.
Couldn't you maybe get the same effect
> do a search on "Term 1" AND "Term 2"
> do a search on "Term 2" AND "Term2" AND "Term 3"
>
> This would ensure that you have two objects back, one of which is
> guaranteed to be a subset of the other.
I did start doing this after sending the email. My only concern is search
speed. Right now I
> What do you mean by "first"? Would you want to process a doc thatdid NOT
> have a "Term 3"?
>
> Let's say you have the following:
> doc1: "Term 1"
> doc2: "Term 2"
> doc3: "Term 1" "Term 2"
> doc4: "Term 3"
> doc5: "Term 1" "Term 2" "Term 3"
> doc6: "Term 2" "Term 3"
>
> Which docs do you want to
Erm.. I have to be missing something here, wouldn't you be able just do
the following:
do a search on "Term 1" AND "Term 2"
do a search on "Term 2" AND "Term2" AND "Term 3"
This would ensure that you have two objects back, one of which is
guaranteed to be a subset of the other.
Then, when yo
What do you mean by "first"? Would you want to process a doc thatdid NOT
have a "Term 3"?
Let's say you have the following:
doc1: "Term 1"
doc2: "Term 2"
doc3: "Term 1" "Term 2"
doc4: "Term 3"
doc5: "Term 1" "Term 2" "Term 3"
doc6: "Term 2" "Term 3"
Which docs do you want to get from your search?
Hi,
I am doing a search on my index for a query like this:
query = "\"Term 1\" \"Term 2\" \"Term 3\""
Where I want to find Term 1, Term 2 and Term 3 in the index. However, I
only want to search for "Term 3" if I find "Term 1" and "Term 2" first, to
avoid doing processing on hits that only contai
I do not know much about RAM FS, but I know for sure if you have enough memory
for RAMDirectory, you should go for it. That gives you the fastest and the most
stable performance, no OS swaps, no sudden performance drops... Uwe's tip is
very good, if you/OS occasionally need RAM for other things
I haven't verified this myself, but I remember talking to somebody who tried
MMapDirectory and compared it to simply using tmpfs (RAM FS). The result was
that MMapDirectory had some memory overhead, so putting the index on tmpfs was
more memory-efficient. I guess this person had read-only indi
On Jul 22, 2009, at 6:30 AM, prashant ullegaddi wrote:
Is it that boost of a Document is stored in 6-bits?
Kind of, the boost is stored in the norm, which also includes other
factors like length normalization. There is one byte for all of those
factors, whereas w/ the function approach,
Thank you both.
> Date: Thu, 23 Jul 2009 11:55:58 -0400
> Subject: Re: Loading an index into memory
> From: erickerick...@gmail.com
> To: java-user@lucene.apache.org
>
> What are you trying to accomplish? I'd insure that my performance wasa
> problem before doing anything. If you're thinking "it
What are you trying to accomplish? I'd insure that my performance wasa
problem before doing anything. If you're thinking "it's in RAM so it
has to be faster" you might be surprised.
So gather evidence that you have a problem before you jump to
providing a solution.
Erick
On Thu, Jul 23, 2009
The size is in bytes and the RAMDirectory stores the bytes in bytes, so size
is equal. I would suggest to not copy the dir into a RAMdirectory. It is
better to use MMapDirectory in this case, as it "swaps" the files into
address space like a normal OS swap file. The OS kernel will automatically
swa
Hi,
I have a question regarding RAMDirectory. I have a 5 GB index on disk and it
is opened like the following:
searcher = new IndexSearcher (new RAMDirectory (indexDirectory));
Approximately how much memory is needed to load the index? 5GB of memory or
10GB because of Unicode? Does the ent
walid, can you provide any more information other than "very poor result"?
Others have not measured much difference between morphological
analysis and light stemming:
http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
On Thu, Jul 23, 2009 at 7:34 AM, walid wrote:
> http://issues.apache.org/jira/browse
This was at least one of the threads that was bouncing around... I'm
fairly sure there were others as well.
Hopefully its worth the read to you ^^
http://www.opensubscriber.com/message/java-...@lucene.apache.org/11079539.html
Phil Whelan wrote:
On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall w
http://issues.apache.org/jira/browse/LUCENE-1406
http://issues.apache.org/jira/browse/LUCENE-153
based on this, there are two options:
1- using the aramorph library
2- moving the code from trunk to the current release and using the
provided arabic analyzer
1- the library works very well in indexi
I think you could also delete by Query (using IndexWriter), concocting
a single large query that's something like MatchAllDocsQuery AND NOT
(Q1 OR Q2 OR Q3...) where Q1, Q2, Q3 are the queries that identify the
docs you want to keep.
Mike
On Wed, Jul 22, 2009 at 10:58 PM, Anuj Bhatt wrote:
> Hi,
I would propose to not sort the date/time by its string value, instead I
would try to represent the date/time as a integer value (e.g. the long
returned by Date.getTime()). If you do not need precision to the
millisecond, you could divide it by some value, e.g.
Date.getTime()/(1000L*60L) to have it
On behalf of the Data Intensive Infrastructure unit (DERI) [1], I'm
pleased to announce the first public version of SIREn (Semantic
Information Retrieval Engine).
SIREn, the Information Retrieval system at the core of the Semantic Web
Index Sindice, is now available for download and includes th
Another idea - instead of storing MMDDhhmm, as longs, store the
value as number of minutes since some start time, as integers. If my
sums are correct it should cope with several thousand years, and
sorting on integers should use less memory than sorting on longs.
--
Ian.
On Thu, Jul 23, 20
Generally you shouldn't hit OOM. But it may change depending on how you use
the index. For example, if you have millions of documents spread across the
100 GB, and you use sorting for various fields, then it will consume lots of
RAM. Also, if you run hundreds of queries in parallel, each with a doz
Thanks all ,
Very thankful to all , am tired of hadoop settings , is it
good to use read such type large index with lucene alone? will it go for OOM
? anyone pl suggest me.
--
View this message in context:
http://www.nabble.com/indexing-100GB-of-data-tp24600563p24620846.html
Sent
29 matches
Mail list logo