nitirajrathore commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1920943319
Hi @benwtrent,
I left Amazon but I was able to run some tests with open dataset and also
with Amazon dataset before leaving. I cannot share whole lot of detail about
Amazon dataset but some aggregated results are shown. Rest assured I am picking
this issue on priority now.
I was able to run tests trying out different heuristics. Basically, I tried
the Lucene `default` diversity check, which generates disconnected even with
50K nodes of minilm dataset. I further tried different variations as suggested
in HNSW paper. The `extended candidates` approach reduced the disconnectedness
but did not eliminate it completely and it increased the indexing time by many
times. Next, `keepPruned candidates` approach with keeping max-conn connections
increased the disconnected nodes. So I tried `keepPruned candidates` with
keeping pruned connections only till max-conn/2. This also did not show any
improvement. Next, I tried the new proposed hueristic but without the component
of removeing the two way connections, so basically the new hueristic with just
removing one side of the duplex connection in the graph. Interestingly, this
also did not change the disconnectedness. This was a bit surprising to me. Then
I tried `new heuristic with remove-otherhalf`, basically re
moving the two way connections completely. This completely removed
disconnecteness and number of disconnected nodes at all levels was zero. But
unfortunately this means that the edges at some nodes can grow beyond the
max-conn. I did not get chance to find the counts and distribution of total
connections at each node which goes beyond max-conn, but I assume this is
something that we may not want in lucene. So I thought may be the removing
duplex edges (i.e. the remove-otherhalf) is the key behind decreasing
disconnectedness. So I tried `default` algo and `keep prunned` both with the
`remove-otherhalf`. Interestingly, those two experiments also did not decrease
number of disconnected nodes.
Now, I was left with `new heuristic with remove other half` as the best
option. To make sure that total connections per node do not grow beyond
max-conn, I modified the algo to remove some random node in case it is not able
to find the best node to remove (`new heuristic with remove otherhalf and
honour max-conn`). This did help to keep the overall disconnectedness to zero,
but it did show some nodes at level1 to be disconnected still (see comments)
for minilm dataset. I tried with max-conn=8, max-conn=16, num_docs=50K and
num_docs=100k. All gave zero overall disconnectedness and zero disconnected
nodes at level0 but there were some disconnected nodes at level1 graph.
Anyway, I also did the experiment using Amazon dataset for `new heuristic
with remove otherhalf and honour max-conn`. I had to do code changes again to
adopt it to Lucene 9.7 version. I saw similar results. Here also `default`
algorithm gave lot of disconnected nodes but the new heuristic gave zero
disconnected nodes at level0 and zero overall disconnected nodes. But at level1
and sometimes at level2 also there were some nodes that were disconnected.
I am more confident of the new heuristic now, but won't mind to run more
tests and perf tests.
PR with all heuristics : https://github.com/apache/lucene/pull/12783/commits
| Algo | no.
of Disconnected Nodes at zeroth level | %age overall disconnected (nodes) |
Comments
| index time |
| -------------------------------------------------------------------- |
----------------------------------------- | --------------------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| ---------- |
| Baseline Equivalent | 90
| 0.1740 (87 nodes) |
Same as baseline but with some extra parameters to allow more experiments
| 33s |
| Baseline | 90
| 0.1740 (87) |
Exactly the code as in production
| |
| Extend Candidates | 39
| 0.0760 (38) |
| 280s |
| keep-pruned till max-conn | 154
| 0.2880 (144) |
disconnected at level1=4 and at level2=1
| 36s |
| keep-pruned till half of max-conn | 97
| 0.1860 (93) |
| 34s |
| new heuristic without remove-otherhalf | 119
| 0.2240 (112) |
| 35s |
| new heuristic with remove-otherhalf | 0
| 0 |
fully connnected at both 50K docs and 100K docs but there were errors at
max-conn=8 as the size of neighbour array cannot grow and this algo allows more
than max-conn connections. | 33s |
| baseline with remove-otherhalf | 91
| 0.1720 (86) |
remove-otherhalf does not give zero disconnectedness with mainline algo
| |
| keep-half-pruned with remove-otherhalf | 90
| 0.1740 (87) | no
effect of remove-otherhalf with keep-half pruned
| 33s |
| new heuristic with remove otherhalf and honour max-conn | 0
| 0 | but
for max-conn=8 and docs=100k I was 35 disconnected nodes at level1. There were
no disconnected nodes at level0 even with max-conn=8
| 36s |
| Amazon data set baseline (lucene-9.7 release) | (75
- 338) in each segment | 0.26 - 0.48 %
| docs=7.5M, segments=13
| |
| Amazon data new heuristic with remove otherhalf and honour max-conn | 77
in 3 out of 10 segments | 0% overall |
docs=7.5M, segments=10, Indexing time 4% increase, 6% increase in red-line
queries/sec, <br>26% decrease in hnsw nodes visisted but 8% increase in avg
query latency. | |
| |
| |
| |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]