Ferenczi Jim created LUCENE-7579:
------------------------------------
Summary: Sorting on flushed segment
Key: LUCENE-7579
URL: https://issues.apache.org/jira/browse/LUCENE-7579
Project: Lucene - Core
Issue Type: Bug
Reporter: Ferenczi Jim
Today flushed segments built by an index writer with an index sort specified
are not sorted. The merge is responsible of sorting these segments potentially
with others that are already sorted (resulted from another merge).
I'd like to investigate the cost of sorting the segment directly during the
flush. This could make the merge faster since they are some cheap optimizations
that can be done only if all segments to be merged are sorted.
For instance the merge of the points could use the bulk merge instead of
rebuilding the points from scratch.
I made a small prototype which sort the segment on flush here:
https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
The idea is simple, for points, norms, docvalues and terms I use the
SortingLeafReader implementation to translate the values that we have in RAM in
a sorted enumeration for the writers.
For stored fields I use a two pass scheme where the documents are first written
to disk unsorted and then copied to another file with the correct sorting. I
use the same stored field format for the two steps and just remove the file
produced by the first pass at the end of the process.
This prototype has no implementation for index sorting that use term vectors
yet. I'll add this later if the tests are good enough.
Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts and
compared master with index sorting against my branch with index sorting on
flush. I tried with sparsetaxis and wikipedia and the first results are weird.
When I use the SerialScheduler and only one thread to write the docs, index
sorting on flush is slower. But when I use two threads the sorting on flush is
much faster even with the SerialScheduler. I'll continue to run the tests in
order to be able to share something more meaningful.
The tests are passing except one about concurrent DV updates. I don't know this
part at all so I did not fix the test yet. I don't even know if we can make it
work with index sorting ;).
[~mikemccand] I would love to have your feedback about the prototype. Could
you please take a look ? I am sure there are plenty of bugs, ... but I think
it's a good start to evaluate the feasibility of this feature.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]