[jira] [Commented] (LUCENE-6065) remove "foreign readers" from merge, fix LeafReader instead.

Uwe Schindler (JIRA) Thu, 20 Nov 2014 07:56:04 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219513#comment-14219513
 ]


Uwe Schindler commented on LUCENE-6065:
---------------------------------------

I agree. Actually you wrap something different than those readers. So maybe 
have some other class that you have on the lower level during merging. One 
class the holds all those (FooReader implementations on the index view). On the 
searching side LeafReader is a basic interface without any implementation. So 
maybe let it be a real java interface implemented by the codec (SegmentReader). 
But you never pass "LeafReader" to the merging api. But making everything that 
is the real LeafReader interface be a final implementation detail is just wrong.

So just have a different type of API behind the scenes when merging that you 
can wrap. And keep LeafReader completely out when merging, just wrap something 
different.

> remove "foreign readers" from merge, fix LeafReader instead.
> ------------------------------------------------------------
>
>                 Key: LUCENE-6065
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6065
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Robert Muir
>         Attachments: LUCENE-6065.patch
>
>
> Currently, SegmentMerger has supported two classes of citizens being merged:
> # SegmentReader
> # "foreign reader" (e.g. some FilterReader)
> It does an instanceof check and executes the merge differently. In the 
> SegmentReader case: stored field and term vectors are bulk-merged, norms and 
> docvalues are transferred directly without piling up on the heap, CRC32 
> verification runs with IO locality of the data being merged, etc. Otherwise, 
> we treat it as a "foreign" reader and its slow.
> This is just the low-level, it gets worse as you wrap with more stuff. A 
> great example there is SortingMergePolicy: not only will it have the 
> low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for 
> all string docvalues fields being merged and other silliness that just makes 
> matters worse.
> Another use case is 5.0 users wishing to upgrade from fieldcache to 
> docvalues. This should be possible to implement with a simple incremental 
> transition based on a mergepolicy that uses UninvertingReader. But we 
> shouldnt populate internal fieldcache entries unnecessarily on merge and 
> spike RAM until all those segment cores are released, and other issues like 
> bulk merge of stored fields and not piling up norms should still work: its 
> completely unrelated.
> There are more problems we can fix if we clean this up, 
> checkindex/checkreader can run efficiently where it doesn't need to RAM spike 
> like merging, we can remove the checkIntegrity() method completely from 
> LeafReader, since it can always be accomplished on producers, etc. In general 
> it would be nice to just have one codepath for merging that is as efficient 
> as we can make it, and to support things like index modifications during 
> merge.
> I spent a few weeks writing 3 different implementations to fix this 
> (interface, optional abstract class, "fix LeafReader"), and the latter is the 
> only one i don't completely hate: I think our APIs should be efficient for 
> indexing as well as search.
> So the proposal is simple, its to instead refactor LeafReader to just require 
> the producer APIs as abstract methods (and FilterReaders should work on 
> that). The search-oriented APIs can just be final methods that defer to those.
> So we would add 5 abstract methods, but implement 10 current methods as final 
> based on those, and then merging would always be efficient.
> {code}
>   // new abstract codec-based apis
>   /** 
>    * Expert: retrieve thread-private TermVectorsReader
>    * @throws AlreadyClosedException if this reader is closed
>    * @lucene.internal 
>    */
>   protected abstract TermVectorsReader getTermVectorsReader();
>   /** 
>    * Expert: retrieve thread-private StoredFieldsReader
>    * @throws AlreadyClosedException if this reader is closed
>    * @lucene.internal 
>    */
>   protected abstract StoredFieldsReader getFieldsReader();
>   
>   /** 
>    * Expert: retrieve underlying NormsProducer
>    * @throws AlreadyClosedException if this reader is closed
>    * @lucene.internal 
>    */
>   protected abstract NormsProducer getNormsReader();
>   
>   /** 
>    * Expert: retrieve underlying DocValuesProducer
>    * @throws AlreadyClosedException if this reader is closed
>    * @lucene.internal 
>    */
>   protected abstract DocValuesProducer getDocValuesReader();
>   
>   /** 
>    * Expert: retrieve underlying FieldsProducer
>    * @throws AlreadyClosedException if this reader is closed
>    * @lucene.internal  
>    */
>   protected abstract FieldsProducer getPostingsReader();
>   // user/search oriented public apis based on the above
>   public final Fields fields();
>   public final void document(int, StoredFieldVisitor);
>   public final Fields getTermVectors(int);
>   public final NumericDocValues getNumericDocValues(String);
>   public final Bits getDocsWithField(String);
>   public final BinaryDocValues getBinaryDocValues(String);
>   public final SortedDocValues getSortedDocValues(String);
>   public final SortedNumericDocValues getSortedNumericDocValues(String);
>   public final SortedSetDocValues getSortedSetDocValues(String);
>   public final NumericDocValues getNormValues(String);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6065) remove "foreign readers" from merge, fix LeafReader instead.

Reply via email to