[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

Ferenczi Jim (JIRA) Mon, 05 Dec 2016 14:21:42 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723545#comment-15723545
 ]


Ferenczi Jim commented on LUCENE-7579:
--------------------------------------

Thanks Mike, 

{quote}
Can we rename freezed to frozen in BinaryDocValuesWriter?
But: why would freezed ever be true when we call flush?
Shouldn't it only be called once, even in the sorting case?
{quote}

This is a leftover that is not needed. The naming was wrong ;) and it's useless 
so I removed it.

{quote}
I also like how you were able to re-use the SortingXXX from
SortingLeafReader. Later on we can maybe optimize some of these;
e.g. SortingFields and CachedXXXDVs should be able to take
advantage of the fact that the things they are sorting are all already
in heap (the indexing buffer), the way you did with
MutableSortingPointValues (cool).
{quote}

Totally agree, we can revisit later and see if we can optimize memory. I think 
it's already an optim vs master in terms of memory usage since we only "sort" 
the segment to be flushed instead of all "unsorted" segments during the merge.

{quote}
Can we block creating a SortingLeafReader now (make its
constructor private)? We only now ever use its inner classes I think?
And it is a dangerous class in the first place... if we can do that,
maybe we rename it SortingCodecUtils or something, just for its
inner classes.
{quote}

We still need to wrap unsorted segments during the merge for BWC so 
SortingLeafReader should remain. I have no idea when we can remove it since 
indices on older versions should still be compatible with this new one ?


{quote}
Do any of the exceptions tests for IndexWriter get angry? Seems like
if we hit an IOException e.g. during the renaming that
SortingStoredFieldsConsumer.flush does we may leave undeleted
files? Hmm or perhaps IW takes care of that by wrapping the directory
itself...
{quote}

Honestly I have no idea. I will dig.

{quote}
Can't you just pass sortMap::newToOld directly (method reference)
instead of making the lambda here?:
{quote}

Indeed, thanks.

{quote}
I think the 6.x back port here is going to be especially tricky 
{quote}

I bet but as it is the main part is done by reusing SortingLeafReader inner 
classes that exist in 6.x. 

I've also removed a nocommit in the AssertingLiveDocsFormat that now checks 
live docs even when they are sorted.



 

> Sorting on flushed segment
> --------------------------
>
>                 Key: LUCENE-7579
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7579
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified 
> are not sorted. The merge is responsible of sorting these segments 
> potentially with others that are already sorted (resulted from another 
> merge). 
> I'd like to investigate the cost of sorting the segment directly during the 
> flush. This could make the merge faster since they are some cheap 
> optimizations that can be done only if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of 
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the 
> SortingLeafReader implementation to translate the values that we have in RAM 
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first 
> written to disk unsorted and then copied to another file with the correct 
> sorting. I use the same stored field format for the two steps and just remove 
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors 
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts 
> and compared master with index sorting against my branch with index sorting 
> on flush. I tried with sparsetaxis and wikipedia and the first results are 
> weird. When I use the SerialScheduler and only one thread to write the docs,  
> index sorting on flush is slower. But when I use two threads the sorting on 
> flush is much faster even with the SerialScheduler. I'll continue to run the 
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know 
> this part at all so I did not fix the test yet. I don't even know if we can 
> make it work with index sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could 
> you please take a look ? I am sure there are plenty of bugs, ... but I think 
> it's a good start to evaluate the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

Reply via email to