[
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877104#comment-16877104
]
Mike Sokolov commented on LUCENE-8781:
--------------------------------------
The extension of this feature to more use cases is tracked in LUCENE-8895, so I
think I'll close again. You can see the proposed changes in the attached PR
there.
> Explore FST direct array arc encoding
> --------------------------------------
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Mike Sokolov
> Assignee: Mike Sokolov
> Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
> Time Spent: 4h 10m
> Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized
> arrays so Arcs are addressed directly by label, avoiding binary search that
> we use today for arrays of Arcs. PR:
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs
> while
> implementing which are more difficult to chase down when uncovered by the
> randomized testing we already do. They don't really test anything new;
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
> ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls
> failed while scanning class
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest'
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException:
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed;
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various
> conditions.
> h5. English words
> A test of looking up existing words in a dictionary of ~170000 English words
> shows improvements; the numbers listed are % change in FST size, time to look
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random
> strings that are not in the dict. The comparison is against the current
> codebase with the optimization disabled. A separate comparison of showed no
> significant change of the baseline (no opto applied) vs the current master
> FST impl with no code changes applied.
> || load=2 || load=4 || load=16 ||
> | +4, -6, -7 | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a
> larger improvement, around 20% (load=4). This is presumably the best case for
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the
> PKLookup case. Other results seem noisy, with perhaps a small improvment in
> some of the queries.
> {noformat}
> Task QPS base StdDev QPS opto StdDev
> Pct diff
> OrHighHigh 6.93 (3.0%) 6.89 (3.1%)
> -0.5% ( -6% - 5%)
> OrHighMed 45.15 (3.9%) 44.92 (3.5%)
> -0.5% ( -7% - 7%)
> Wildcard 8.72 (4.7%) 8.69 (4.6%)
> -0.4% ( -9% - 9%)
> AndHighLow 274.11 (2.6%) 273.58 (3.1%)
> -0.2% ( -5% - 5%)
> OrHighLow 241.41 (1.9%) 241.11 (3.5%)
> -0.1% ( -5% - 5%)
> AndHighMed 52.23 (4.1%) 52.41 (5.3%)
> 0.3% ( -8% - 10%)
> MedTerm 1026.24 (3.1%) 1030.52 (4.3%)
> 0.4% ( -6% - 8%)
> HighTerm 1111.10 (3.4%) 1116.70 (4.0%)
> 0.5% ( -6% - 8%)
> HighTermDayOfYearSort 14.59 (8.2%) 14.73 (9.3%)
> 1.0% ( -15% - 20%)
> AndHighHigh 13.45 (6.2%) 13.61 (4.4%)
> 1.2% ( -8% - 12%)
> HighTermMonthSort 63.09 (12.5%) 64.13 (10.9%)
> 1.6% ( -19% - 28%)
> LowTerm 1338.94 (3.3%) 1383.90 (5.5%)
> 3.4% ( -5% - 12%)
> PKLookup 120.45 (2.5%) 130.91 (3.5%)
> 8.7% ( 2% - 15%)
> {noformat}
> h4.FST perf tests
> I ran LookupBenchmarkTest to see the impact on the suggesters which make
> heavy use of FSTs. Some show little or no improvement, but in some cases
> there are substantial gains.
> !FST-2-4.png!
> !FST-6-9.png!
> !FST-size.png!
> chart TK
> h3. API change / implementation notes
> The only change in the public FST API is that the Builder constructor now
> takes an additional
> boolean ("useDirectArcAddressing") controlling whether or not this new
> optimization is
> applied. However, FST's internal details are not really hidden, so in
> practice any change to its
> encoding can have ripple effects in other classes.
> This is because the FST decoding is repeated in several places in the code
> base, sometimes with
> subtle variations: eg FST, FSTEnum, and fst.Util have very similar, but not
> shared, traversal code,
> and there are a few other places this same or similar decoding algorithm
> appears as well (eg
> blocktreeords codec). I do think it would be useful to make this code more
> DRY, and to hide the
> traversal implementation within a single class (or at least in the package),
> but it seemed better to
> do that separately. I also think there are possibilities for improving this
> encoding further, maybe
> by encoding Arcs in contiguous blocks with gaps in between, but before
> attempting anything like
> that, I think it would good to do that refactor to make the code in its
> current form easier to
> understand and work with.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]