kotman12 opened a new pull request, #2382:
URL: https://github.com/apache/solr/pull/2382

   # Description
   
   The module hopes to simplify distribution and scaling query-indexes for 
monitoring and alerting workflows (also known as reverse search) by providing a 
bridge between solr-managed search index and lucene-monitor's efficient reverse 
search algorithms.
   
   Here is some evidence that the community might find this useful.
   1. [Blog-post](https://opensourceconnections.com/blog/2016/02/05/luwak/) 
that partly inspired the current approach
   1. Users asking about a percolator-like feature on 
[stackoverflow](https://stackoverflow.com/questions/30473406/does-solr-support-percolation).
   1. Someone contributed [this 
extension](https://github.com/SOLR4189/solcolator) but it doesn't really 
provide percolator-like functionality and because it wasn't upstreamed it fell 
out of maintenance.
   1. Plug for [my own 
question](https://www.mail-archive.com/users@solr.apache.org/msg07027.html) on 
the issue!
   
   # Solution
   
   This is still a WiP but I am opening up as a PR to get community feedback. 
The current approach is to ingest queries as solr documents, 
[decompose](https://github.com/apache/lucene/blob/main/lucene/monitor/src/java/org/apache/lucene/monitor/QueryDecomposer.java)
 them for perfromance, and then use child-document feature [to index the 
decomposed 
subqueries](https://github.com/kotman12/solr/blob/solr-monitor/solr/modules/monitor/src/java/org/apache/solr/monitor/update/MonitorUpdateRequestProcessor.java#L112)
 under one atomic parent document block. On the search side the latest approach 
is to use a dedicated component that creates hooks into lucene-monitor's 
[Presearcher](https://github.com/apache/lucene/blob/main/lucene/monitor/src/java/org/apache/lucene/monitor/Presearcher.java),
 
[QueryTermFilter](https://github.com/apache/lucene/blob/baecaf556f8fe5db69d130f0a9094e83c2f5f226/lucene/monitor/src/java/org/apache/lucene/monitor/QueryIndex.java#L136)
 and [CandidateMatcher](https://github
 
.com/apache/lucene/blob/baecaf556f8fe5db69d130f0a9094e83c2f5f226/lucene/monitor/src/java/org/apache/lucene/monitor/CandidateMatcher.java).
 
   
   The current [optional cache 
implementation](https://github.com/kotman12/solr/blob/solr-monitor/solr/modules/monitor/src/java/org/apache/solr/monitor/cache/SharedMonitorCache.java)
 uses caffeine instead of lucene-monitor's [simpler 
ConcurrentHashMap](https://github.com/apache/lucene/blob/baecaf556f8fe5db69d130f0a9094e83c2f5f226/lucene/monitor/src/java/org/apache/lucene/monitor/WritableQueryIndex.java#L68).
 It's worth noting that this cache should likely be quite a bit larger than 
your average query or document cache since query parsing involves a non-trivial 
amount of compute and disk I/O (especially for large results and/or queries). 
It's also worth noting that lucene-monitor will keep _all_ the indexed queries 
cached in memory with in its default configuration. A unique solr-monitor 
feature was the addition of a bespoke cache warmer that tries to populate the 
cache with _approximately_ all the latest updated queries since the last 
commit. This approach was added to have a baselin
 e when comparing with lucene-monitor performance. The goal was to make it 
possible to effectively cache all queries in memory (since that is what 
lucene-monitor enables by default) but not necessarily require it.
   
   Currently the PR has some visitor classes in the `org.apache.lucene.monitor` 
package that exposes certain lucene-monitor internals. If this approach gets 
accepted then the lucene project will likely need to be updated to expose what 
is necessary.
   
   # Tests
   
   1. **testMonitorQuery**: basic functionality before and after an update
   1. **testNoDocListInResponse**: The current API allows for two types of 
responses, a special `monitorDocuments` response that can relay 
lucene-monitor's response structure and unique features such as "reverse" 
highlights. The other response structure is a regular solr document list with 
each "response" document really referring to a query that matches the "real" 
document that is being matched. This test ensures you can disable the solr 
document list from coming in the response. An outstanding task would be to 
allow the disabling of the bespoke `monitorDocuments` response section as well.
   1. **testDefaultParser**: validate that solr-monitor routes to default parse 
when none is selected.
   1.  **testDisjunctionQuery**: validate that subqueries of a disjunction get 
indexed seperately.
   1.  **testNoDanglingDecomposition**: validate that deleting a top-level 
query also removes all the child disjuncts.
   1. **testNotQuery**
   1.  **testWildCardQuery**
   1.  **manySegmentsQuery**: The cache warmer has reader-leaf-dependent logic 
so this was included to verify everything works on a multi-segment index.
   
   All of the above are also tested with below custom configurations:
   1. Parallel matcher - lucene monitor allows for running the final, 
most-expensive matching step in a multi-threaded environment. The current 
solr-monitor implementation allows for this with some restrictions. For 
instance, it is difficult to populate a document response list from a fully 
asynchronous matching component because it would require awkwardly opening and 
closing leaf collectors on-demand. The more idiomatic solr approach would be to 
just run this on many shards and gain parallelism as recommended 
[here](https://www.mail-archive.com/solr-user@lucene.apache.org/msg129138.html).
 Still, during testing I found that a fully async postfilter in a single shard 
had better performance than an equally parallel multi-sharded, synchronous 
postfilter so I've decided to keep it in the initial proposal. On top of that, 
it helps achieve greater feature parity with lucene-monitor (which obviously 
has no concept of sharding so can only parallelize with a special matcher).
   1. Stored monitor query - allow storing queries with `stored="true"` instead 
of using the recommended `docValues`. docValues have stricter single-value size 
limits so this is mainly to accommodate humongous queries
   
   I'll report here that I also have some local performance tests which are 
difficult to port but that helped guide some of the decisions so far. I've also 
"manually" tested the custom tlog deserialization of the [derived query 
field](https://github.com/kotman12/solr/blob/solr-monitor/solr/modules/monitor/src/java/org/apache/solr/monitor/update/MonitorUpdateRequestProcessor.java#L203)
 but that this should probably go somewhere in the special `TlogReplay` test. I 
haven't gone down that rabbit hole yet as I wanted to poll for some feedback 
first. The reason we skip TLog for the derived query fields is because these 
fields wrap a tokenstream which in itself is difficult to serialize without a 
custom analyzer. The goal was to let users leverage their existing document 
schema as often as possible instead of having to create something custom for 
the query-monitoring use-case.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/solr/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [ ] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`. **TODO some apparently unrelated test 
failures**
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Reference 
Guide](https://github.com/apache/solr/tree/main/solr/solr-ref-guide)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to