[
https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710238#comment-16710238
]
Toke Eskildsen commented on LUCENE-8585:
----------------------------------------
Thank you for your suggestions, [~jpountz]. I am a bit surprised that you
advocate for having the jump-tables for {{IndexedDISI}} and {{vBPV}} on heap?
They are not that big, but it still means heap + extra startup-time to fetch
them!? It also means that it will be done for all fields, regardless of whether
they are used or not, if I understand Entry-loading correctly.
I have a hard time discerning what the principal difference is between this and
search-time build of the jump-tables. The only difference I see is that the
index-time version loads the data contiguously instead of having to visit every
block, which boils down to segment-open performance. I am not opposed to
pre-computing and having them on heap - if nothing else it will make lookups
slightly faster - I just can't follow the logic.
I tinkered a bit and got the {{IndexedDISI}} block jump-tables to work when
stored as I described in the first comment (after the regular blocks, accessed
as needed). The scary codec beast is not _that_ scary once you get to know it.
It should be easy enough to flush the data to meta instead.
As this has the potential to be more of a collaborative project, what is the
easiest way (ping to [~jim.ferenczi])? Sending patches back and forth seems a
bit heavy to me, so perhaps a git branch? Should I create one in the Apache git
repository or on GitHub?
> Create jump-tables for DocValues at index-time
> ----------------------------------------------
>
> Key: LUCENE-8585
> URL: https://issues.apache.org/jira/browse/LUCENE-8585
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Affects Versions: master (8.0)
> Reporter: Toke Eskildsen
> Priority: Minor
>
> As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid
> long iterative walks. This is implemented in LUCENE-8374 at search-time
> (first request for DocValues from a field in a segment), with the benefit of
> working without changes to existing Lucene 7 indexes and the downside of
> introducing a startup time penalty and a memory overhead.
> As discussed in LUCENE-8374, the codec should be updated to create these
> jump-tables at index time. This eliminates the segment-open time & memory
> penalties, with the potential downside of increasing index-time for DocValues.
> The three elements of LUCENE-8374 should be transferable to index-time
> without much alteration of the core structures:
> * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for
> every 65536 documents, containing the offset of the block in 33 bits and the
> index (number of set bits) up to the block in 31 bits.
> It can be build sequentially and should be stored as a simple sequence of
> consecutive longs for caching of lookups.
> As it is fairly small, relative to document count, it might be better to
> simply memory cache it?
> * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16
> bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block.
> Each {{short}} represents the number of set bits up to right before the
> corresponding sub-block of 512 docIDs.
> The \{{shorts}} can be computed sequentially or when the DENSE block is
> flushed (probably the easiest). They should be stored as a simple sequence of
> consecutive shorts for caching of lookups, one logically independent sequence
> for each DENSE block. The logical position would be one sequence at the start
> of every DENSE block.
> Whether it is best to read all the 16 {{shorts}} up front when a DENSE block
> is accessed or whether it is best to only read any individual {{short}} when
> needed is not clear at this point.
> * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric
> values. Each {{long}} holds the offset to the corresponding block of values.
> The offsets can be computed sequentially and should be stored as a simple
> sequence of consecutive {{longs}} for caching of lookups.
> The vBPV-offsets has the largest space overhead og the 3 jump-tables and a
> lot of the 64 bits in each long are not used for most indexes. They could be
> represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}},
> with the downsides of a potential lookup-time overhead and the need for doing
> the compression after all offsets has been determined.
> I have no experience with the codec-parts responsible for creating
> index-structures. I'm quite willing to take a stab at this, although I
> probably won't do much about it before January 2019. Should anyone else wish
> to adopt this JIRA-issue or co-work on it, I'll be happy to share.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]