Indexing & search?

2007-03-06 Thread senthil kumaran

Hi,
   I've indexed 4 among 5 fields with Field.Store.YES & Field.Index.NO. And
indexed the remaining one, say it's Field Name is *content*, with
Field.Store.YES & Field.Index.Tokenized(It's value is collective value of
other 4 fields and some more values).So my search always based on
*content*field.
   I've indexed 2 douments . In 1st doc, f1:mybook, f2:contains, f3:all,
f4:information, content:mybook contains all information that you need
and in 2nd   f1:somebody, f2:want, f3:search, f4:information,
content:somebody want search information of mybook
   I want to get search results of all docs where field1's value is
"mybook".My query is content:mybook.But it returns 2 matching documents
instead of 1.
   Any filters can i use for this??
   Is there any possible way other than changing f1 to
Field.Index.tokenized???Because i want to avoid duplication in index.


Re: question about ScoreDocComparator

2007-03-06 Thread Ulf Dittmer
Well, I am using a Sort object ("Hits = Search.search(Query, Filter,  
Sort)" actually).
In setting up the SortField array for that Sort object with a  
SortComparatorSource the issue comes up that I need to access the  
field value that is being used for sorting.


Maybe that's just the way Lucene works, but it seems that there  
should be an easy way to get at the field value without having to  
retrieve the document from the index.


Cheers,
Ulf


On 05.03.2007, at 00:05, Erick Erickson wrote:

Maybe I'm missing something in turn, but why not just use a Sort  
object at
search time?  You can have a Hits object or TopFIeldDocs object  
returned

(the Filter in some of these calls can be null).

Best
Erick


On 3/1/07, Ulf Dittmer <[EMAIL PROTECTED]> wrote:


Hello-

One of the fields in my index is an ID, which maps to a full text
description behind the scenes. Now I want to sort the search results
alphabetically according to the description, not the ID. This can be
done via SortComparatorSource and a ScoreDocComparator without
problems. But the code needed to do this is quite complicated - it
involves retrieving the document ID from the ScoreDoc, then looking
up the Document through an IndexReader, and then retrieving the ID
field from the document. It seems that there should be an easier way
to get at the ID field, since that is the one being used for the
sort. There is a related class FieldDoc, through which it seems
possible to get at the field values, but that doesn't seem  
applicable here.


I went through the custom sorting example of "Lucene In Action", but
that doesn't deal with this case. Am I missing something obvious?

Thanks in advance,
Ulf



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clearing locks

2007-03-06 Thread John Haxby

MC Moisei wrote:

Is there a easy way to clear locks ?

If I redeploy my war file and it happens that there is an indexing
happening the lock is not cleared. I know I can tell JVM to run the
finalizers before it exits but in this case the JVM is not exiting being
a hot deploy.
  
I'd do this by having a destroy() method in the servlet to explicitly 
shut down any operations.  Tomcat (or whatever the servlet is) will call 
destroy() for you when it shuts down the servlet.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Erik Hatcher

Have a look at the CachingWrappingFilter:

	


It caches filters by IndexReader instance.

Erik


On Mar 6, 2007, at 2:03 AM, Antony Bowesman wrote:

Not sure if I'm going about this the right way, but I want to use  
Query instances as a key to a HashMap to cache BitSet instances  
from filtering operations.  They are all for the same reader.


That means equals() for any instance of the same generic Query  
would have to return true if the terms, boost and other factors of  
the Query would result in the same BitSet.  Most of the Query  
instances override equals and return true based on the Query.  At  
least BooleanQuery would not give true, because it does not base  
the equals on the encapsulated clauses, but in practice that is not  
a problem as BooleanQuery will not be used as a Filter.


It looks like it will work in the main, but I was wondering if  
there was any unwritten, but exepected, contract for a Query's  
equals()?


Thanks
Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Antony Bowesman

Erik Hatcher wrote:

Have a look at the CachingWrappingFilter:

 



It caches filters by IndexReader instance.


Doesn't that still have the same issue in terms of equality of conditions that 
created the filter.  If I have conditions that filter Term X, then the cached 
Filter is only valid for new requests for Term X.  Term equality is defined by 
the Javadocs as having the same Field and Text, but to cache a Query, its 
equality must be deterministic in a similar way, but it isn't.


I was hoping that Query.equals() would be defined so that equality would be 
based on the results that Query generates for a given reader.


I'm hosting an indexing framework, so I've no idea what searches or filters a 
caller will want to perform.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Erik Hatcher


On Mar 6, 2007, at 6:35 AM, Antony Bowesman wrote:

Erik Hatcher wrote:

Have a look at the CachingWrappingFilter:
 It caches filters by IndexReader  
instance.


Doesn't that still have the same issue in terms of equality of  
conditions that created the filter.  If I have conditions that  
filter Term X, then the cached Filter is only valid for new  
requests for Term X.  Term equality is defined by the Javadocs as  
having the same Field and Text, but to cache a Query, its equality  
must be deterministic in a similar way, but it isn't.


A Query's equality is defined as having the same structure and order  
as another Query.


I was hoping that Query.equals() would be defined so that equality  
would be based on the results that Query generates for a given reader.


That is certainly not the case, as stated above.  query1.equals 
(query2) when all the nested clauses also report back they  
are .equals one another.  This is very important in our unit tests -  
to construct a query through the QueryParser and then through the API  
and compare them.


I'm hosting an indexing framework, so I've no idea what searches or  
filters a caller will want to perform.


Have a look at Solr's caching mechanisms for filters, queries, and  
documents.  Very slick and scalable stuff.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing & search?

2007-03-06 Thread Erick Erickson

You could analyze all the documents returned in your query to see
if the "other fields" match. That is, could cycle through each
document returned in, say, a hits object to see if f1 actually matches.

This is almost certainly NOT what you want to do. Do you have any
clue whether saving the space is actually worth it? How big to
you expect your index to be? Disk space is cheap, and Lucene
handles pretty big indexes well. For instance, I've found that the
search time in a 4G index is, maybe, 10-15% faster than an 8G
index. So unless and until you *know*
there's a problem, you should index all the fields you want to search
on, keeping the design as simple as possible. Only after you *know*
there's a problem should you consider efficiencies

Best
Erick

On 3/6/07, senthil kumaran <[EMAIL PROTECTED]> wrote:


Hi,
I've indexed 4 among 5 fields with Field.Store.YES & Field.Index.NO.
And
indexed the remaining one, say it's Field Name is *content*, with
Field.Store.YES & Field.Index.Tokenized(It's value is collective value of
other 4 fields and some more values).So my search always based on
*content*field.
I've indexed 2 douments . In 1st doc, f1:mybook, f2:contains, f3:all,
f4:information, content:mybook contains all information that you need
and in 2nd   f1:somebody, f2:want, f3:search, f4:information,
content:somebody want search information of mybook
I want to get search results of all docs where field1's value is
"mybook".My query is content:mybook.But it returns 2 matching documents
instead of 1.
Any filters can i use for this??
Is there any possible way other than changing f1 to
Field.Index.tokenized???Because i want to avoid duplication in index.



Re: Indexing & search?

2007-03-06 Thread Steven Rowe
Hi senthil,

senthil kumaran wrote:
>I've indexed 4 among 5 fields with Field.Store.YES & Field.Index.NO. And
> indexed the remaining one, say it's Field Name is *content*, with
> Field.Store.YES & Field.Index.Tokenized(It's value is collective value of
> other 4 fields and some more values).So my search always based on
> *content*field.
>I've indexed 2 douments . In 1st doc, f1:mybook, f2:contains, f3:all,
> f4:information, content:mybook contains all information that you need
> and in 2nd   f1:somebody, f2:want, f3:search, f4:information,
> content:somebody want search information of mybook
>I want to get search results of all docs where field1's value is
> "mybook".My query is content:mybook.But it returns 2 matching documents
> instead of 1.
>Any filters can i use for this??
>Is there any possible way other than changing f1 to
> Field.Index.tokenized???Because i want to avoid duplication in index.

Your query is behaving as it should - since the "content" field in both
docs contains "mybook", they both match.

Although you say you want to avoid duplication in the index, I think you
already know what to do (you wrote "I want to get search results of all
docs where field1's value is 'mybook'") - index "field1" to make it
directly queryable.  If the information really needs to be distinct to
query properly, then make it so.

And if the index gets too large, you can try removing the duplication
from the "content" field, and include the other fields in your queries.

Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search benchmark: 2.0 vs. 2.2-dev and heap sizing

2007-03-06 Thread Otis Gospodnetic
Hi,

I'm doing some Lucene search benchmarking (got to love massive query logs :)) 
and have 2 questions:

1) Has anyone compared Lucene 2.0 and 2.2-dev?  My benchmarks found 2.2-dev 
(freshly baked) to be somewhat slower than 2.0, despite all those performance 
improvements (see CHANGES.txt)... Has anyone else done the comparison?  My 
queries are a mixture of 2-3 required keywords (majority) and phrase queries 
with 2-3 keywords.

To give you an idea about how much slower 2.2-dev is for me, here are some 
counts for queries I considered slow (> 1s latency) during my benchmark with 8 
concurrent search threads and then 64 threads:


$ grep -c SLOW 5-shard-log-2.0/8.log 
1183
$ grep -c SLOW 5-shard-log-2.2-dev/8.log 
5479

$ grep -c SLOW 5-shard-log-2.0/64.log 
28657
$ grep -c SLOW 5-shard-log-2.2-dev/64.log 
33459

This is of a total of 100K queries.

2) My benchmark was against 5 optimized compound Lucene indices, about 9GB 
each, on a box with 32GB of RAM and several CPUs.  I gave the JVM 22GB with Xms 
and Xmx.  However, I am wondering if giving it that much is actually smart.  
While I'm letting JVM use more RAM, I'm taking it away from the OS for FS 
caching.  So, I'm now thinking about running the same benchmark, but with a 
smaller max heap.  But how much should I give it?  I'm thinking about adding up 
sizes of all .tii files, adding some padding for the JVM, GC, etc., and using 
that.  Is there anything else I should consider here?

So I looked at one of the .cfs files:

_0.f0: 11164467 bytes
... other fields, same size, of course
_0.fdt: 381343723 bytes
_0.fdx: 89315736 bytes
_0.fnm: 78 bytes
_0.frq: 4591955197 bytes
_0.prx: 4242807266 bytes
_0.tii: 11498861 bytes
_0.tis: 829868070 bytes


Here, the .tii file is only about 11 MB.  That looks awfully small!  There is 
no way 5 x 11 MB + padding will be enough.  Should I be adding the size of some 
other file(s)?  .tis perhaps?

Thanks,
Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search benchmark: 2.0 vs. 2.2-dev and heap sizing

2007-03-06 Thread Doron Cohen
This is interesting.

Very large heaps can sometimes cause an expensive gc cycle ("Can heap be
too big?" - http://www.javaperformancetuning.com/news/qotm045.shtml) and
different memory allocation patterns between 2.0 and 2.2 could I think play
in too, so it would be interesting to know the numbers with smaller heap
sizes.

A few more questions:
- Readers reuse: Are all searches of the same thread sharing
searchers/readers? Are different threads sharing searchers/readers?
- What happens with a single thread?
- Is this degradation visible also by single queries, or are some queries
faster in 2.0 and some in 2.2?

Thanks,
Doron

Otis Gospodnetic <[EMAIL PROTECTED]> wrote on 06/03/2007 08:16:16:

> Hi,
>
> I'm doing some Lucene search benchmarking (got to love massive query
> logs :)) and have 2 questions:
>
> 1) Has anyone compared Lucene 2.0 and 2.2-dev?  My benchmarks found
> 2.2-dev (freshly baked) to be somewhat slower than 2.0, despite all
> those performance improvements (see CHANGES.txt)... Has anyone else
> done the comparison?  My queries are a mixture of 2-3 required
> keywords (majority) and phrase queries with 2-3 keywords.
>
> To give you an idea about how much slower 2.2-dev is for me, here
> are some counts for queries I considered slow (> 1s latency) during
> my benchmark with 8 concurrent search threads and then 64 threads:
>
>
> $ grep -c SLOW 5-shard-log-2.0/8.log
> 1183
> $ grep -c SLOW 5-shard-log-2.2-dev/8.log
> 5479
>
> $ grep -c SLOW 5-shard-log-2.0/64.log
> 28657
> $ grep -c SLOW 5-shard-log-2.2-dev/64.log
> 33459
>
> This is of a total of 100K queries.
>
> 2) My benchmark was against 5 optimized compound Lucene indices,
> about 9GB each, on a box with 32GB of RAM and several CPUs.  I gave
> the JVM 22GB with Xms and Xmx.  However, I am wondering if giving it
> that much is actually smart.  While I'm letting JVM use more RAM,
> I'm taking it away from the OS for FS caching.  So, I'm now thinking
> about running the same benchmark, but with a smaller max heap.  But
> how much should I give it?  I'm thinking about adding up sizes of
> all .tii files, adding some padding for the JVM, GC, etc., and using
> that.  Is there anything else I should consider here?
>
> So I looked at one of the .cfs files:
>
> _0.f0: 11164467 bytes
> ... other fields, same size, of course
> _0.fdt: 381343723 bytes
> _0.fdx: 89315736 bytes
> _0.fnm: 78 bytes
> _0.frq: 4591955197 bytes
> _0.prx: 4242807266 bytes
> _0.tii: 11498861 bytes
> _0.tis: 829868070 bytes
>
>
> Here, the .tii file is only about 11 MB.  That looks awfully small!
> There is no way 5 x 11 MB + padding will be enough.  Should I be
> adding the size of some other file(s)?  .tis perhaps?
>
> Thanks,
> Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: alternative scoring algorithm for PhraseQuery

2007-03-06 Thread Paul Elschot
Philipp,

First off: I have no solutions, just some existing things that might
be useful.

On Tuesday 06 March 2007 01:08, Philipp Nanz wrote:
> Hello folks,
> 
...
> 
> Now my problem is with scoring the deletion cases.
> 
> My initial idea was to penalize a missing term position with its maximum 
error.
> 
...
> 
> Does anyone know a better solution for scoring deletion cases?

The closest existing thing is coord() factor from Similarity.
But using that will only allow you to delay the implementation.

You can also use the sloppyFreq() from Similarity, but then
you still need a way to determine the sloppiness for any possible
term order. Span queries to that by just taking the distance
between the first and last matched "term".

Would your implementation also generalize to a SpanQuery?

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Missing .tii File

2007-03-06 Thread Tim Patton



Tim Patton wrote:
I'm not sure how, but in moving an index over from 2.0 to 2.1 and 
changing my own code one of the .tii files got deleted.  I still have 
the .tis file though, can I rebuild the missing file so I can open my 
index?  Luke won't open it now and I just want to make sure everything 
is ok before opening a writer and possibly doing some permanent damage.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


In the interest if helping out the next person with this problem, here 
is all my code to recover the missing .tii file when I still had the 
rest of the index.


Of course now I am also missing one .fN norm file so I will be trying to 
figure out how to recover that.  Looks like everything else is there, I 
have no idea how this happened.


package org.apache.lucene.index;

import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import org.apache.lucene.index.TermInfo;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class TIIRecover
{
public static void main(String[] args)
{
try
{
String segment="_aginr";
Directory 
cfsDir=FSDirectory.getDirectory("c:/java/lib/lucene/recover");
FieldInfos fieldInfos = new FieldInfos(cfsDir, segment + 
".fnm");
			SegmentTermEnum origEnum = new 
SegmentTermEnum(cfsDir.openInput(segment + ".tis"), fieldInfos, false);


List termList = new LinkedList();
List termInfoList = new 
LinkedList();
int count=0;
while(origEnum.next())
{
Term term = origEnum.term();
TermInfo ti = origEnum.termInfo();
termList.add(term);
termInfoList.add(ti);
count++;
}
origEnum.close();
System.out.println("Copied: "+count);
count=0;

			TermInfosWriter termWriter = new TermInfosWriter(cfsDir, segment, 
fieldInfos, 128);//128 taken from TermInfosWriter.java

Iterator termItr = termList.iterator();
Iterator termInfoItr = 
termInfoList.iterator();
while(termItr.hasNext())
{
Term term =termItr.next();
TermInfo ti = termInfoItr.next();
termWriter.add(term, ti);
count++;
}
termWriter.close();
System.out.println("Saved: "+count);

}
catch(Throwable e)
{
System.err.println(e);
}
}
}


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: alternative scoring algorithm for PhraseQuery

2007-03-06 Thread Chris Hostetter
: My initial idea was to penalize a missing term position with its maximum 
error.
:
: Consider this:
: Query:  a b c d
: Document A: b c d
:
: Term a is missing, score it as if it was at the worst position possible
:
: result:   b c d a
: pos. diffs: -1 -1 -1 +3

side comment: this doesn't sound very useful, a document containing "b c
d" matches equally to a doc containing "b c d a" ? ... shouldn't a doc
containing "b c d a" be considered a much better match since it at least
contains all of the terms close together?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Chris Hostetter

: I was hoping that Query.equals() would be defined so that equality would be
: based on the results that Query generates for a given reader.

if query1.equals(query2) then the results of query1 on an
indexreader should be identical to the results of query2 on the same
indexreader ... but there inverse can not be garunteed: if query1 and
query2 generate identical results when queried against an indexreader that
says absolutely nothing about wether query1.equals(query2).

if you think about it, there's no possible way it ever could, because a
critical piece of information isn't available when testing the
.equals()ness of those queries: the indexreader.  if i have a completley
empty index then the queries "foo:bar" and "yak:wak"will both have the
exact same results, but those same queries on an index with a single
document added might now generate different results -- so how could an
algorithm like you describe possibly be implemented in a Query.equals()
method when the IndexReader isn't known?

in general, what you describe really isn't needed for caching query result
sets ... what matters is that if you've already seen the query before
(which you can tell using q1.equals(q2)) then you don't need to execute it
.. wether or not it results in the same set of docs as a completley
unrelated query doesn't really tell you much (i suppose you could save
some space by reusing the same BitSet object ... but that can be done by
testing hte equality of hte resulting BitSet)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Antony Bowesman

Chris Hostetter wrote:

: I was hoping that Query.equals() would be defined so that equality would be
: based on the results that Query generates for a given reader.

if query1.equals(query2) then the results of query1 on an
indexreader should be identical to the results of query2 on the same
indexreader 


Thanks Hoss and Erik.  This is the case I wanted, but re-reading my desire 
above, I see it looks more like the inverse.  Sorry for the confusion.



... but there inverse can not be garunteed: if query1 and
query2 generate identical results when queried against an indexreader that
says absolutely nothing about wether query1.equals(query2).


Yes, that's not what I was after - As you say, it's not possible to implement.


in general, what you describe really isn't needed for caching query result
sets ... what matters is that if you've already seen the query before
(which you can tell using q1.equals(q2)) then you don't need to execute it


Exactly, and to be sure of that you have to be able to rely on an overridden 
equals to get q1.equals(q2).  The core Lucene Query implementations do override 
equals() to satisfy that test, but some of the contrib Query implementations do 
not override equals, so you would never see the same Query twice and caching 
BitSets for those Query instances would be a waste of time.


Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Chris Hostetter

: equals to get q1.equals(q2).  The core Lucene Query implementations do 
override
: equals() to satisfy that test, but some of the contrib Query implementations 
do
: not override equals, so you would never see the same Query twice and caching
: BitSets for those Query instances would be a waste of time.

fileing bugs about those Query instances would be helpful .. bugs with
patches that demonstrate the problem in unit tests and fix them would be
even more helpful :)

These classes may prove useful in submitting test cases...

http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/search/QueryUtils.java?view=log
http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/search/CheckHits.java



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using ParallelReader over large immutable index and small updatable index

2007-03-06 Thread Andy Liu

Is there a working solution out there that would let me use ParallelReader
to search over a large, immutable index and a smaller, auxillary index that
is updated frequently?  Currently, from my understanding, the ParallelReader
fails when one of the indexes is updated because the document ID's get out
of synch.  Using ParallelReader in this way is attractive for me because it
would allow me to quickly make updates to only the fields that change.

The alternative is to use one index.  However, an update would require me to
delete the entire document (which is quite large in my application) and
reinsert it after making updates.  This requires a lot more I/O and is a lot
slower, and I'd like to avoid this if possible.

I can think of other alternatives, but all involve storing data and/or
bitsets in memory, which is not very scalable.  I need to be able to handle
millions of documents.

I'm also open to any solution that don't involve ParallelReader that would
help me make quick updates in the most non-disruptive and scalable fashion.
But it just seems that ParallelReader would be perfect for me needs, if I
can get past this issue.

I've seen posts about this issue on the list, but nothing pointing to a
solution.  Can somebody help me out?

Andy


Re: Indexing & search?

2007-03-06 Thread Antony Bowesman

Hi,


   I've indexed 4 among 5 fields with Field.Store.YES & Field.Index.NO. And
indexed the remaining one, say it's Field Name is *content*, with
Field.Store.YES & Field.Index.Tokenized(It's value is collective value of
other 4 fields and some more values).So my search always based on
*content*field.
   I've indexed 2 douments . In 1st doc, f1:mybook, f2:contains, f3:all,
f4:information, content:mybook contains all information that you need
and in 2nd   f1:somebody, f2:want, f3:search, f4:information,
content:somebody want search information of mybook
   I want to get search results of all docs where field1's value is
"mybook".My query is content:mybook.But it returns 2 matching documents
instead of 1.


The example shows the first 4 words of each 'content' being stored as f1, f2, 
f3, f4.  If that is your intention, then you can use SpanFirstQuery to find 
words that were in f1.  It can also be used to find hits in words 2-4, but you 
will have to test the hits to find out the positional match.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Using ParallelReader over large immutable index and small updatable index

2007-03-06 Thread Alexey Lef
We use MultiSearcher for a similar scenario. This way you can keep the 
Searcher/Reader for the read-only index alive and refresh the small index 
Searcher whenever an update is made. If you have any cached filters, they are 
mapped to a Reader, so the cached filters for the big index will stay alive as 
well. The only (small) problem I have found so far is how MultiSearcher handles 
custom Similarity (see https://issues.apache.org/jira/browse/LUCENE-789). 

Hope this helps,

Alexey 

-Original Message-
From: Andy Liu [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 06, 2007 3:34 PM
To: java-user@lucene.apache.org
Subject: Using ParallelReader over large immutable index and small updatable 
index

Is there a working solution out there that would let me use ParallelReader
to search over a large, immutable index and a smaller, auxillary index that
is updated frequently?  Currently, from my understanding, the ParallelReader
fails when one of the indexes is updated because the document ID's get out
of synch.  Using ParallelReader in this way is attractive for me because it
would allow me to quickly make updates to only the fields that change.

The alternative is to use one index.  However, an update would require me to
delete the entire document (which is quite large in my application) and
reinsert it after making updates.  This requires a lot more I/O and is a lot
slower, and I'd like to avoid this if possible.

I can think of other alternatives, but all involve storing data and/or
bitsets in memory, which is not very scalable.  I need to be able to handle
millions of documents.

I'm also open to any solution that don't involve ParallelReader that would
help me make quick updates in the most non-disruptive and scalable fashion.
But it just seems that ParallelReader would be perfect for me needs, if I
can get past this issue.

I've seen posts about this issue on the list, but nothing pointing to a
solution.  Can somebody help me out?

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Antony Bowesman

Chris Hostetter wrote:

: equals to get q1.equals(q2).  The core Lucene Query implementations do 
override
: equals() to satisfy that test, but some of the contrib Query implementations 
do
: not override equals, so you would never see the same Query twice and caching
: BitSets for those Query instances would be a waste of time.

fileing bugs about those Query instances would be helpful .. bugs with
patches that demonstrate the problem in unit tests and fix them would be
even more helpful :)


OK, I'll put it on my todo list, but I've got to get the product out of the door 
this month...



These classes may prove useful in submitting test cases...

http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/search/QueryUtils.java?view=log
http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/search/CheckHits.java


Thanks for those pointers.
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to define a pool for Searcher?

2007-03-06 Thread Mohammad Norouzi

Hello Mark,
there is something vague for me about the Lucene-indexAccessor you created
and my problem.
as I see your codes, you create IndexSearcher and put it into a Map and the
only thing that separate them is the Similarity the have. so if say 1000
users with different Similarity connect to my application there will be 1000
IndexSearcher with their own internal Reader.
now, in my case, I have an IndexResultSet just like java.sql.ResultSet which
it contains a Hits. so a user may go forward or backward through the Hits'
documents and actually every user are doing this job.

to do so, I have to find the Similarity that a user working with it and find
the right IndexSearcher in order to support pagination for her. is this
right? I mean can I trust to Similarity to find the right IndexReader that a
user have used it before?

another question is, how about I have one IndexReader for all my
IndexSearcher and manage them simultaneously to access that single Reader.?

thank you very much in advance


On 2/22/07, Mark Miller <[EMAIL PROTECTED]> wrote:


I would not do this from scratch...if you are interested in Solr go that
route else I would build off
http://issues.apache.org/jira/browse/LUCENE-390

- Mark

Mohammad Norouzi wrote:
> Hi all,
> I am going to build a Searcher pooling. if any one has experience on
> this, I
> would be glad to hear his/her recommendation and suggestion. I want to
> know
> what issues I should be apply. considering I am going to use this on a
> web
> application with many user sessions.
>
> thank you very much in advance.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Regards,
Mohammad