Re: Algorithm of retrieving docs

Michael McCandless Thu, 13 Feb 2014 07:22:34 -0800

The bloom filter is only used by the postings format wrapper, and
we've had mixed results on whether it helps performance or not (seems
to depend heavily on the exact usage).


We have bit set / iterator abstractions (oal.util.Bits,
oal.search.DocIdSet/Iterator) to manage "sets" of documents, but most
implementations don't use a hash set under the hood.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 13, 2014 at 7:11 AM, Harshvardhan Ojha
<[email protected]> wrote:
> Hi Mike/Mikhail,
>
> Don't you guys think org.apache.lucene.codecs.bloom.FuzzySet.java,
> contains(BytesRef value) methods returns probablity of having a field, and
> it is a place where we are using hashing ?
>
> Are there any other place in source which when given with document id, could
> determine by calculating its hash and say if document with this id is
> present or not in a single lookup O(1) ?
>
> Regards
> Harshvardhan Ojha
>
>
> On Thu, Feb 13, 2014 at 5:11 PM, Michael McCandless
> <[email protected]> wrote:
>>
>> Lucene only assigns its int docID during indexing.
>>
>> Retrieving a previously stored document is a O(1), but that involves a
>> disk seek which can be very costly when the page is not in the OS's IO
>> cache.  Lucene does not do any caching itself (relies on the OS
>> instead).
>>
>> Have a look at the current default stored fields codec format:
>>
>> lucene/core/src/java/org/apache/lucene/codec/lucene41/Lucene41StoredFieldsFormat
>> for details.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Feb 12, 2014 at 11:27 PM, Harshvardhan Ojha
>> <[email protected]> wrote:
>> > Hi All,
>> >
>> > I have a question regarding retrieval of documents by lucene.
>> > I know lucene uses many files on disk to keep documents, each comprising
>> > fields in it, and uses many IR algorithms, and inverted index to match
>> > documents.
>> >
>> > My question is :
>> > 1. How lucene stores these documents inside file system and gets it so
>> > fast?
>> > 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not
>> > which
>> > DS is         used by lucene ?
>> > 3. Except id provided by us at the time of indexing, is there any other
>> > unique identifier       which is assigned by lucene to its documents ?
>> >
>> > I will appreciate If someone can provide me with source file names to
>> > study
>> > these algorithms in detail.
>> >
>> > Regards
>> > Harshvardhan Ojha
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Algorithm of retrieving docs

Reply via email to