Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Paul J. Lucas
On Jun 10, 2009, at 5:02 PM, Yonik Seeley wrote: On Wed, Jun 10, 2009 at 7:58 PM, Daniel Noll wrote: It's a shame we don't have an inverted kind of HitCollector where we can say "give me the next hit", so that we can get the best of both worlds (like what StAX gives us in the XML world.) You

Re: Phrase search

2009-06-10 Thread Daniel Noll
On Fri, Jun 5, 2009 at 21:31, Abhi wrote: > Say I have indexed the following strings: > > 1. "cool gaming laptop" > 2. "cool gaming lappy" > 3. "gaming laptop cool" > > Now when I search with a query say "cool gaming computer", I want string 1 > and 2 to appear on top (where search terms are closer

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 7:58 PM, Daniel Noll wrote: > It's a shame we don't have an inverted kind of HitCollector where we > can say "give me the next hit", so that we can get the best of both > worlds (like what StAX gives us in the XML world.) You can get a scorer and call next() yourself. -Yo

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Daniel Noll
On Wed, Jun 10, 2009 at 20:17, Uwe Schindler wrote: > You are right, you can, but if you just want to retrieve all hits, this is > ineffective. A HitCollector is the correct way to do this (especially > because the order of hits is mostly not interesting when retrieving all > hits). Hits and TopDoc

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Paul J. Lucas
On Jun 10, 2009, at 10:49 AM, Uwe Schindler wrote: To optimize, store the filename not as stored field, but as a non- tokenized, indexed term. How do you do that? - Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucen

Re: Lucene memory usage

2009-06-10 Thread Jason Rutherglen
Great! If I understand correctly it looks like RAM savings? Will there be an improvement in lookup speed? (We're using binary search here?). Is there a precedence in database systems for what was mentioned about placing the term dict, delDocs, and filters onto disk and reading them from there (wit

Lucene 2.9 Release

2009-06-10 Thread Mark Miller
So... how about we try to wrap up 2.9/3.0 and ship with what we have, now? It's been 8 months since 2.4.0 was released, and 2.9's got plenty of new stuff, and we are all itching to remove these deprecated APIs, switch to Java 1.5, etc. We should try to finish the issues that are open and under

Re: Lucene memory usage

2009-06-10 Thread Michael McCandless
Roughly, the current approach for the default terms dict codec in LUCENE-1458 is: * Create a separate class per-field (the String field in each Term is redundant). This is a big change over Lucene today * That class has String[] indexText and long[] indexPointer, each length = th

Re: Lucene memory usage

2009-06-10 Thread Jason Rutherglen
> LUCENE-1458 (flexible indexing) has these improvements, Mike, can you explain how it's different? I looked through the code once but yeah, it's in with a lot of other changes. On Wed, Jun 10, 2009 at 5:40 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > This (very large number of

Re: Lucene memory usage

2009-06-10 Thread Michael McCandless
Asking for top 100K docs will certainly consume more RAM than asking for top 2, but much less than 1 GB. More like maybe an added ~2-3 MB or so. Mike On Wed, Jun 10, 2009 at 1:30 PM, Zhang, Lisheng wrote: > Hi, > > Does this issue has anything to do with the line: > >> TopScoreDocCollector colle

RE: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Uwe Schindler
That looks good, but contains the inner search loop (looking up the stored fields from within the main search loop, which is the hit collector). For few results this is ok, but if you are collecting thousands of hits from a very large index that does not fit into memory, the collect gets slow becau

RE: Lucene memory usage

2009-06-10 Thread Zhang, Lisheng
Hi, Does this issue has anything to do with the line: > TopScoreDocCollector collector = new TopScoreDocCollector(10); if we do: > TopScoreDocCollector collector = new TopScoreDocCollector(2); instead (only see top two documents), could memory usage be less? Best regards, Lisheng -Or

Re: Reloading RAM Directory from updated FS Directory

2009-06-10 Thread Michael Stoppelman
Another potential idea would be to break up the index into N indices such that each index is small enough to fit two in memory and then you can swap them. http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/MultiReader.html This is just an idea, I haven't tri

RE: Reloading RAM Directory from updated FS Directory

2009-06-10 Thread Diamond, Greg
Thanks for the responses. I am testing it out using MMapDirectory. Cheers! -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Wednesday, June 10, 2009 6:36 AM To: java-user@lucene.apache.org Subject: RE: Reloading RAM Directory from updated FS Directory There is cur

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Paul J. Lucas
On Jun 10, 2009, at 3:17 AM, Uwe Schindler wrote: A HitCollector is the correct way to do this (especially because the order of hits is mostly not interesting when retrieving all hits). OK, here's what I came up with: Term t = /* ... */ Collection files = new LinkedList(); FieldS

Re: Lucene memory usage

2009-06-10 Thread Michael McCandless
This (very large number of unique terms) is a problem for Lucene currently. There are some simple improvements we could make to the terms dict format to not require so much RAM per term in the terms index... LUCENE-1458 (flexible indexing) has these improvements, but unfortunately tied in w/ lots

Lucene memory usage

2009-06-10 Thread Benedikt Boss
Hej hej, i have a question regarding lucenes memory usage when launching a query. When i execute my query lucene eats up over 1gig of heap-memory even when my result-set is only a single hit. I found out that this is due to the "ensureIndexIsRead()" method-call in the "TermInfosReader" class, wh

Re: indexing performance problems

2009-06-10 Thread Michael McCandless
Thanks for bringing closure! Mike On Wed, Jun 10, 2009 at 4:42 AM, Mateusz Berezecki wrote: > Hi list! > > I'm forwarding as somehow I did not put the list in the CC but the > answer I think is noteworthy, so here it is. Please remember to use > StringBuffer before blaming lucene ;-) > > Actual t

RE: Reloading RAM Directory from updated FS Directory

2009-06-10 Thread Uwe Schindler
There is currently a patch/idea from Earwin around that modifies MMapDirectory to optionally call MappedByteBuffer.load() after mapping a file from the directory. MappedByteBuffer.load() tells the operating system kernel to try to swap as much as possible from the file into physical RAM. - Uwe

Re: Reloading RAM Directory from updated FS Directory

2009-06-10 Thread eks dev
there is one case where MMAP does not beat RAM, initial warm-up after process restart. With MMAP it can take a while before you get up to speed. MMAP with reopen is the best, if you run without restart. - Original Message > From: Uwe Schindler > To: java-user@lucene.apache.org >

RE: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Uwe Schindler
> You are wrong. > As the java doc reads: 'Finds the top n hits for query' > You can set n to whatever value you want, 'all' documents (not results!) > indexed in your index if you want, or 10 if you want the top 10. You are right, you can, but if you just want to retrieve all hits, this is ineffe

RE: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Wouter Heijke
You are wrong. As the java doc reads: 'Finds the top n hits for query' You can set n to whatever value you want, 'all' documents (not results!) indexed in your index if you want, or 10 if you want the top 10. Anyway, it's just an example to give a direction.. Wouter > This code snipplet would on

RE: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Uwe Schindler
This code snipplet would only work, if you want to iterate over e.g. the first 20 documents (which is n in your code). If he wants to iterate over all results, he should think about using a custom (Hit)Collector. The code below will be very slow for large result sets (because retrieving stored fie

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Wouter Heijke
Will this do? IndexReader indexReader = searcher.getIndexReader(); TopDocs topDocs = searcher.search(Query query, int n); for (int i = 0; i < topDocs.scoreDocs.length; i++) { Document document = indexReader.document( topDocs.scoreDocs[i].doc); final File f = new File( document.get( "FILE" ) )

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Ian Lea
Hi The code below might do the job. Based on the example at http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Hits.html Completely uncompiled and untested of course. TopDocCollector collector = new TopDocCollector(hitsPerPage); final Term t = /* ... */; Query query = new Te

Re: Using lucene in a clustered app server

2009-06-10 Thread Ian Lea
I'd recommend using your favourite queueing service to pass all updates to a central process, the one and only process that updates the index. If you don't already have a favourite queueing service, http://en.wikipedia.org/wiki/Java_Message_Service#Provider_implementations lists several JMS implem

Re: indexing performance problems

2009-06-10 Thread Mateusz Berezecki
Hi list! I'm forwarding as somehow I did not put the list in the CC but the answer I think is noteworthy, so here it is. Please remember to use StringBuffer before blaming lucene ;-) Actual time consumed by lucene is now ~130 minutes as opposed to 20 hours which is neat. I can do much more passes