Also, I think this is something you could easily trial, just take out the Kafka step, and replace it with say a insert into a solr collection, and see what happens.
Monitoring the daemon process is easy too ;-) > On Sep 7, 2021, at 8:50 AM, Joel Bernstein <joels...@gmail.com> wrote: > > There was a design implemented in Streaming Expression for large scale > alerting described here: > > https://joelsolr.blogspot.com/2017/01/deploying-solrs-new-parallel-executor.html > > In this design you would store each alert in Solr as a topic expression. > Then a single daemon can run all the topics or it can be parallelized. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Tue, Sep 7, 2021 at 6:32 AM Charlie Hull <ch...@opensourceconnections.com> > wrote: > >> Hi Dan, >> >> Yuval and my suggestions both rely on the same underlying code (Luwak, >> now called Lucene Monitor). This lets you store a set of Lucene queries >> and run them against every new document. >> >> The Lucene Monitor allows for very high-performance matching (I know of >> situations with around 1m stored queries, monitoring 1m new documents a >> day running on a few tens of nodes) and it does this with some clever >> optimisations: effectively it builds an index of your stored queries, >> and turns each new document into a query across this index (I know it >> sounds confusing!). It's a 'reverse search'. Check out the original >> Luwak project as it's got links to several presentations and blogs >> showing how others have implemented these systems. >> >> The bit you'll have to build is the Solr layer and then the code that >> uses this to generate alerts - and Solcolator and >> https://github.com/o19s/solr-monitor are two examples of how to do the >> first part, which you can build on. The facility to do a reverse search >> is not built into Solr - yet, unlike Elasticsearch's Percolator. >> >> Best >> >> Charlie >> >> On 07/09/2021 10:24, Dan Rosher wrote: >>> Thanks Eric, Charlie and Yuval for all the feedback and suggestions. >>> >>> Eric: Yes I thought the monitoring might be a it of a pain, esp with >>> millions of them, I'll have to check out the topic code, but I wondered >> if >>> I can look @ the checkpoint collections for uniqueIds that haven't been >>> updated for a 'while' which might suggest the demon had stopped/died, >>> rather than checking each daemon individually? >>> >>> I was also wondering whether it's possible, or a useful enhancement to >> look >>> at the replica index version (as opposed to _vesion_ ) for the topic >>> streaming expression to skip queries where the replica index is the same >> as >>> what we might store in the checkpoint collection ? For collections that >>> update infrequently I think this might be useful. >>> >>> Charlie: It was for email alerts, so a user stores a query for collection >>> docs to match against, and then the system emails matches to the user. Do >>> you think solr-monitor can be used for this purpose? >>> >>> Yuval: I like the idea of using the UpdateProcessor, at least there's no >>> need for deamons or monitoring of them, but would this scale for millions >>> of email queries though? >>> >>> Many thanks again to all. >>> >>> Kind regards, >>> Dan >>> >>> >>> >>> >>> On Mon, 6 Sept 2021 at 18:47, Yuval Paz <yuval.p...@mail.huji.ac.il> >> wrote: >>> >>>> Me and my team are building upon this solcolator: >>>> https://github.com/SOLR4189/solcolator >>>> >>>> Currently the processor is build for Solr 6.5.1, we are working on >> updating >>>> our Solr and I hope to release a complete version of our Solcolator as >>>> open source then (it will be for version 8.6.x). >>>> >>>> Making it an update processor (either make it the last element and >> replace >>>> the usual processor that index the document, or by using it as the one >> from >>>> last processor in the collection, and so allow monitoring also atomic >>>> updates [which is relatively costly]). >>>> >>>> By making it an update processor we don't rely on the streaming deamon, >>>> which we found unsatisfying as we wish to allow users to define their >> own >>>> monitors over the index. >>>> >>>> On Mon, Sep 6, 2021, 8:25 PM Charlie Hull < >> ch...@opensourceconnections.com >>>> wrote: >>>> >>>>> Are you trying to monitor a stream of emails for certain patterns? In >>>>> which case you might look at the Lucene Monitor >>>>> >>>>> >>>> >> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html >>>>> https://issues.apache.org/jira/browse/LUCENE-8766, which was >> originally >>>>> Luwak - at my previous company Flax we helped build several large-scale >>>>> monitoring systems with this https://github.com/flaxsearch/luwak . >> It's >>>>> not officially surfaced in Solr yet although my colleague Scott Stults >>>>> has been working on some ideas: https://github.com/o19s/solr-monitor >>>>> >>>>> best >>>>> Charlie >>>>> >>>>> On 06/09/2021 14:32, Dan Rosher wrote: >>>>>> Hi, >>>>>> >>>>>> I was wondering if anyone had tried email alerts with streaming >>>>>> expressions, and what their experience was if attempting this with say >>>> 12 >>>>>> million emails / day? Traditionally this might have been done with a >>>>>> database cursor iterator daily. >>>>>> >>>>>> I was thinking if something like the following pseudocode expression >>>> with >>>>>> 'kafka' as a custom push expression: >>>>>> >>>>>> daemon(id="alertId", >>>>>> runInterval="1000", >>>>>> kafka( >>>>>> kafka_topic, >>>>>> alertId, >>>>>> topic(email_alerts, >>>>>> doc_collection, >>>>>> q="email query", >>>>>> fl="id, title, abstract", >>>>>> id="alertId", >>>>>> initialCheckpoint=0) >>>>>> ) >>>>>> >>>>>> If you have done something like this 'where' would you typically run >>>> the >>>>>> daemon, on replicas away from replicas running web queries? >>>>>> >>>>>> Many thanks in advance for any advice / suggestions, >>>>>> >>>>>> Dan >>>>>> >>>>> -- >>>>> Charlie Hull - Managing Consultant at OpenSource Connections Limited >>>>> <www.o19s.com> >>>>> Founding member of The Search Network <https://thesearchnetwork.com/> >>>>> and co-author of Searching the Enterprise >>>>> <https://opensourceconnections.com/about-us/books-resources/> >>>>> tel/fax: +44 (0)8700 118334 >>>>> mobile: +44 (0)7767 825828 >>>>> >>>>> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin >>>>> Amtsgericht Charlottenburg | HRB 230712 B >>>>> Geschäftsführer: John M. Woodell | David E. Pugh >>>>> Finanzamt: Berlin Finanzamt für Körperschaften II >>>>> >> >> -- >> Charlie Hull - Managing Consultant at OpenSource Connections Limited >> <www.o19s.com> >> Founding member of The Search Network <https://thesearchnetwork.com/> >> and co-author of Searching the Enterprise >> <https://opensourceconnections.com/about-us/books-resources/> >> tel/fax: +44 (0)8700 118334 >> mobile: +44 (0)7767 825828 >> >> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin >> Amtsgericht Charlottenburg | HRB 230712 B >> Geschäftsführer: John M. Woodell | David E. Pugh >> Finanzamt: Berlin Finanzamt für Körperschaften II >> _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.