RE: splitting docIds from a search by segment [SEC=UNOFFICIAL]

Stephen GRAY Sun, 03 Nov 2013 17:00:26 -0800

UNOFFICIAL

Hi Mike,


I ran it again and this time the two methods came out about the same: 168 - 288 
ms to process 173,000 documents for the walking method and 160 - 205 ms for the 
MultiDocValues method . I don't know what was happening with my last test.

Here is my code:

if (docs.totalHits > 0)
{
        int currentContextIndex = 0;
        List<AtomicReaderContext> leaves = searcher.getIndexReader().leaves();
        AtomicReaderContext currentContext = leaves.get(currentContextIndex);
        NumericDocValues values = getNumericDocValues(currentContext, 
"responseTime");
        for (ScoreDoc scoreDoc : docs.scoreDocs)
        {
                while (scoreDoc.doc >= (currentContext.docBase + 
currentContext.reader().maxDoc()))
                {
                        currentContext = leaves.get(++currentContextIndex);
                        values = getNumericDocValues(currentContext, 
"responseTime");
                }
                
                int value = (int)values.get(scoreDoc.doc - 
currentContext.docBase);
                // do stuff
        }
}

private NumericDocValues getNumericDocValues(final AtomicReaderContext context, 
final String field) throws ProfileException {
        try
        {
                return context.reader().getNumericDocValues(field);
        }
        catch (IOException e)
        {
                throw new ProfileException("Unable to extract results from 
index for query 'read response times'.", e);
        }
}

Thanks for the tip on using a custom Collector. This is in Lucene in Action 
(great book by the way).

Regards,
Steve


-----Original Message-----
From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: Monday, 4 November 2013 10:30 AM
To: Lucene Users
Subject: Re: splitting docIds from a search by segment [SEC=UNOFFICIAL]

It's very strange that you see faster performance using
MultiDocValues: that simply should not be the case.  Can you share your 
per-segment code?

Also, it's rather inefficient to collect all hits by passing maxDoc as n to 
IndexSearcher.search; if you really just want the docIDs and you don't care 
about order it's better to make a custom Collector that simply appends the 
docID to an array/list.  I believe Lucene in Action includes an example for 
this... (disclosure: I'm one of the authors).

Mike McCandless

http://blog.mikemccandless.com


On Sun, Nov 3, 2013 at 5:37 PM, Stephen GRAY <stephen.g...@immi.gov.au> wrote:
> UNOFFICIAL
>
> That's what I did. You just pass searcher.search a very large value for max 
> docs so you get them all, then iterate through the ScoreDoc[] array - the 
> docId is in scoreDoc.doc.
>
> Regards,
> Steve
>
> -----Original Message-----
> From: Kyle Judson [mailto:kvjud...@hotmail.com]
> Sent: Sunday, 3 November 2013 12:37 AM
> To: java-user@lucene.apache.org
> Subject: Re: splitting docIds from a search by segment 
> [SEC=UNOFFICIAL]
>
> All,
>
> Is the best way to get the docIDs in a case like this to use 
> IndexSercher.search to get TopDocs and then get the ScoreDoc[] from 
> TopDocs.scoreDocs?
>
> Thanks
>
> Kyle
>
>
> On 10/30/13 4:56 AM, "Michael McCandless" <luc...@mikemccandless.com>
> wrote:
>
>>You should try MultiDocValues first; it's trivial to use and may not 
>>be horribly slow.
>>
>>It must do a binary-search for every docID lookup.
>>
>>And then if this is too slow, assuming you traverse the docIDs in 
>>order, you can use IndexReader.leaves() to get the sub-readers.  The 
>>docIDs are just "appended" from these sub-readers, so you'd walk your 
>>docIDs and also walk you sub-readers, moving to the next sub-reader 
>>once you have a docID that's beyond its end.  Each sub-reader spans 
>>AtomicReaderContext.docBase to docBase + 
>>AtomicReaderContext.reader.maxDoc().
>>
>>Mike McCandless
>>
>>http://blog.mikemccandless.com
>>
>>On Wed, Oct 30, 2013 at 2:21 AM, Stephen GRAY 
>><stephen.g...@immi.gov.au>
>>wrote:
>>> UNOFFICIAL
>>> Hi everyone,
>>>
>>> I am trying to write an application that loops through 500,000 -
>>>1,000,000 documents returned by a search and calculates some 
>>>statistics using the value in a stored field. Obviously this needs to 
>>>be as fast as possible so I am using a NumericDocValues field to store the 
>>>value.
>>>
>>> What I don't know is how to get the NumericDocValues value for each 
>>>docId returned by the search. What I've been told to do in a previous 
>>>thread was:
>>>
>>> 1.       Split the docIds according to the segment they belong to
>>>
>>> 2.       Get a per-segment NumericDocValues instance and use this to
>>>extract the values
>>>
>>> Can someone tell me how to do 1 and 2? I don't know how to discover 
>>>what segment a given docId is in, or how to convert a segment into a 
>>>NumericDocValues array.
>>>
>>> By the way it's also been suggested that I just use 
>>>MultiDocValue.getNumericValues, but I gather that this will be much 
>>>slower.
>>>
>>> I'd appreciate any help,
>>>
>>> Thanks,
>>> Steve
>>>
>>> UNOFFICIAL
>>>
>>>
>>> --------------------------------------------------------------------
>>> Important Notice: If you have received this email by mistake, please 
>>>advise  the sender and delete the message and attachments immediately.
>>>This email,  including attachments, may contain confidential, 
>>>sensitive, legally privileged  and/or copyright information.  Any 
>>>review, retransmission, dissemination  or other use of this 
>>>information by persons or entities other than the  intended recipient 
>>>is prohibited.  DIAC respects your privacy and has  obligations under 
>>>the Privacy Act 1988.  The official departmental privacy  policy can 
>>>be viewed on the department's website at www.immi.gov.au.
>>>See:
>>> http://www.immi.gov.au/functional/privacy.htm
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> UNOFFICIAL
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


UNOFFICIAL

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: splitting docIds from a search by segment [SEC=UNOFFICIAL]

Reply via email to