Re: Hit.getDocument performance

mark harwood Fri, 24 Nov 2006 04:46:14 -0800

Look in the latest SVN version - there is some new code for "Lazy field 
loading" i.e. not incurring the hit for retrieving *all* fields if you only 
want to retrieve a subset from a document. Not used it myself yet but it may be 
applicable.


If you *really* want all matching docs too then I wouldn't use Hits - I think 
it retains a limited set of doc ids and will rerun your query whenever you page 
your way through it's limited cache in order to get the next batch. Look at the 
HitCollector API instead.

To be honest though building an app which returns an unlimited volume
of PDF contents sounds like a good candidate for OutOfMemory exceptions
- are the results streamed in any way? Any buffering of the entire result could 
be a killer.

Cheers
Mark


----- Original Message ----
From: Luis Rodrigo Aguado <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 24 November, 2006 12:14:27 PM
Subject: Re: Hit.getDocument performance

I have just read in the API doc that going through the Hits returned is 
not really adviceable. However, I am not developing the final 
application, but a middleware that accesses Lucene, so I would not want 
to take the decision to cut the number of docs returned, but let the 
application do that. Is there any way to bypass this limitation?

Thanks!


Luis Rodrigo Aguado escribió:
>    Hi all,
>
>    I am having a performance bottleneck that is driving me crazy. 
> Maybe anyone there has a clue of the source...
>
>    I am working with an index of 2400 pdf files. For each of them, I 
> index the contents, and I store the filename and the creation date. 
> Nothing else. The resulting index is about 6Mb.
>
>    The application generates several queries for each user input, and, 
> depending on the queries I launch, it may take up to 10 minutes to get 
> the results!!!  It depends on the number of hits, being around 1500 
> docs the highest number of hits I have tested. After a profiling 
> session I have located the Hits.getDocument as the primary source of 
> time (and memory) waste.
>
>    Is this reasonable?  Maybe I did something wrong to create the 
> index?  Are there any workarounds that you imagine?
>
>    Thanks in advance!!!
>
>    Luis.
>
>
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.430 / Virus Database: 268.14.14/548 - Release Date: 23/11/2006 
> 15:22
>   

-- 

*Luis Rodrigo Aguado*

Innovation and R&D

Research Manager

lrodrigo(at)isoco.com

#T  +34913349777

C/Pedro de Valdivia, 10

28006, Madrid, Spain

* *

*iSOCO** *

            intelligent software for the networked economy

            www.isoco.com <http://www.isoco.com/>

 

Este mensaje se dirige exclusivamente a su destinatario y puede contener 
información privilegiada o confidencial. Si no es vd. el destinatario 
indicado, queda notificado de que la utilización, divulgación y/o copia 
sin autorización está prohibida en virtud de la legislación vigente. Si 
ha recibido este mensaje por error, le rogamos que nos lo comunique 
inmediatamente por esta misma vía y proceda a su destrucción.

 

This message is intended exclusively for its addressee and may contain 
information that is CONFIDENTIAL and protected by professional 
privilege. If you are not the intended recipient you are hereby notified 
that any dissemination, copy or disclosure of this communication is 
strictly prohibited by law. If this message has been received in error, 
please immediately notify us via e-mail and delete it.






        
        
                
___________________________________________________________ 
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease 
of use." - PC Magazine 
http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Hit.getDocument performance

Reply via email to