Re: Lucene 4.2 DocValues

Adrien Grand Tue, 28 May 2013 13:44:00 -0700

On Tue, May 28, 2013 at 8:55 PM, Arun Kumar K <arunk...@gmail.com> wrote:
> Thanks for clarifying the things.
> I have some doubts regarding sorting :
>>
>> While you can do that, I don't recommend it. For example, if you have
>> 5 fields, loading all fields from stored fields requires at most 1
>> disk seek while loading all fields from doc values requires at least 5
>> disk seeks for disk-based doc values.
>
>
> 1> I am assuming those mentioned 5 fields are sortable fields upon which 
> sorting is done.
> In my understanding, loading stored fields takes 1 disk seek for finding file 
> pointer & 1 disk seek for getting all those fields.


This was correct until Lucene 4.0, but since 4.1, Lucene stores the
doc ID -> file pointer mapping in memory, ensuring at most 1 disk
seek.

> Since different file is maintained for a particular doc value field. We get 5 
> disk seeks + 1 disk seek for file pointer.

There is no general rule since this depends on the doc values type and
the codec implementation, but you got the idea.

> If we have only one sortable field , which could be better ? I guess no diff.

Just to make things clear, before Lucene had doc values, sorting was
performed based on the inverted index (which was uninverted and stored
in memory using FieldCache), not stored fields. Stored fields are bad
for sorting because they are usually large and don't play nice with
the file system cache.

Doc values are very similar to FieldCache except that the hard work is
done at indexing time instead of searching time. This is good
trade-off because it allows for faster loading of indexes and for
off-loading data to disk. This is never a bad idea to use doc values
for sorting.

> Also, I vaguely remember that there is some performance loss for sorting 
> based on string in lucene 4.0
> Then, will the decision change for String field or based on type of field ?

I don't see why String sorting would be slower. However, it is true
that String sorting requires a lot of memory. If your field is a
number, you should definitely use a numeric field cache.

> 2> Also, In my understanding, if we need to use parser based queries for 
> docvalues, we need to have a storedfield for a doc with same name & value of 
> the doc's docvalue.
> Even term queries won't work. Am i right here?

QueryParser is completely unaware of your schema. If you want
QueryParser to use doc-values-based queries, you can override
QueryParser.newRangeQuery and/or QueryParser.newFieldQuery to return a
new ConstantScoreQuery that wraps a FieldCacheRangeFilter.

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene 4.2 DocValues

Reply via email to