Also, I think this is something you could easily trial, just take out the Kafka 
step, and replace it with say a insert into a solr collection, and see what 
happens.

Monitoring the daemon process is easy too  ;-)


> On Sep 7, 2021, at 8:50 AM, Joel Bernstein <joels...@gmail.com> wrote:
> 
> There was a design implemented in Streaming Expression for large scale
> alerting described here:
> 
> https://joelsolr.blogspot.com/2017/01/deploying-solrs-new-parallel-executor.html
> 
> In this design you would store each alert in Solr as a topic expression.
> Then a single daemon can run all the topics or it can be parallelized.
> 
> 
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> 
> On Tue, Sep 7, 2021 at 6:32 AM Charlie Hull <ch...@opensourceconnections.com>
> wrote:
> 
>> Hi Dan,
>> 
>> Yuval and my suggestions both rely on the same underlying code (Luwak,
>> now called Lucene Monitor). This lets you store a set of Lucene queries
>> and run them against every new document.
>> 
>> The Lucene Monitor allows for very high-performance matching (I know of
>> situations with around 1m stored queries, monitoring 1m new documents a
>> day running on a few tens of nodes) and it does this with some clever
>> optimisations: effectively it builds an index of your stored queries,
>> and turns each new document into a query across this index (I know it
>> sounds confusing!). It's a 'reverse search'. Check out the original
>> Luwak project as it's got links to several presentations and blogs
>> showing how others have implemented these systems.
>> 
>> The bit you'll have to build is the Solr layer and then the code that
>> uses this to generate alerts - and Solcolator and
>> https://github.com/o19s/solr-monitor are two examples of how to do the
>> first part, which you can build on. The facility to do a reverse search
>> is not built into Solr - yet, unlike Elasticsearch's Percolator.
>> 
>> Best
>> 
>> Charlie
>> 
>> On 07/09/2021 10:24, Dan Rosher wrote:
>>> Thanks Eric, Charlie and Yuval for all the feedback and suggestions.
>>> 
>>> Eric: Yes I thought the monitoring might be a it of a pain, esp with
>>> millions of them, I'll have to check out the topic code, but I wondered
>> if
>>> I can look @ the checkpoint collections for uniqueIds that haven't been
>>> updated for a 'while' which might suggest the demon had stopped/died,
>>> rather than checking each daemon individually?
>>> 
>>> I was also wondering whether it's possible, or a useful enhancement to
>> look
>>> at the replica index version (as opposed to _vesion_ ) for the topic
>>> streaming expression to skip queries where the replica index is the same
>> as
>>> what we might store in the checkpoint collection ? For collections that
>>> update infrequently I think this might be useful.
>>> 
>>> Charlie: It was for email alerts, so a user stores a query for collection
>>> docs to match against, and then the system emails matches to the user. Do
>>> you think solr-monitor can be used for this purpose?
>>> 
>>> Yuval: I like the idea of using the UpdateProcessor, at least there's no
>>> need for deamons or monitoring of them, but would this scale for millions
>>> of email queries though?
>>> 
>>> Many thanks again to all.
>>> 
>>> Kind regards,
>>> Dan
>>> 
>>> 
>>> 
>>> 
>>> On Mon, 6 Sept 2021 at 18:47, Yuval Paz <yuval.p...@mail.huji.ac.il>
>> wrote:
>>> 
>>>> Me and my team are building upon this solcolator:
>>>> https://github.com/SOLR4189/solcolator
>>>> 
>>>> Currently the processor is build for Solr 6.5.1, we are working on
>> updating
>>>> our Solr and I hope to release a complete version of our Solcolator  as
>>>> open source then (it will be for version 8.6.x).
>>>> 
>>>> Making it an update processor (either make it the last element and
>> replace
>>>> the usual processor that index the document, or by using it as the one
>> from
>>>> last processor in the collection, and so allow monitoring also atomic
>>>> updates [which is relatively costly]).
>>>> 
>>>> By making it an update processor we don't rely on the streaming deamon,
>>>> which we found unsatisfying as we wish to allow users to define their
>> own
>>>> monitors over the index.
>>>> 
>>>> On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <
>> ch...@opensourceconnections.com
>>>> wrote:
>>>> 
>>>>> Are you trying to monitor a stream of emails for certain patterns? In
>>>>> which case you might look at the Lucene Monitor
>>>>> 
>>>>> 
>>>> 
>> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
>>>>> https://issues.apache.org/jira/browse/LUCENE-8766, which was
>> originally
>>>>> Luwak - at my previous company Flax we helped build several large-scale
>>>>> monitoring systems with this https://github.com/flaxsearch/luwak .
>> It's
>>>>> not officially surfaced in Solr yet although my colleague Scott Stults
>>>>> has been working on some ideas: https://github.com/o19s/solr-monitor
>>>>> 
>>>>> best
>>>>> Charlie
>>>>> 
>>>>> On 06/09/2021 14:32, Dan Rosher wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I was wondering if anyone had tried email alerts with streaming
>>>>>> expressions, and what their experience was if attempting this with say
>>>> 12
>>>>>> million emails / day? Traditionally this might have been done with a
>>>>>> database cursor iterator daily.
>>>>>> 
>>>>>> I was thinking if something like the following pseudocode expression
>>>> with
>>>>>> 'kafka' as a custom push expression:
>>>>>> 
>>>>>> daemon(id="alertId",
>>>>>>         runInterval="1000",
>>>>>>         kafka(
>>>>>>          kafka_topic,
>>>>>>          alertId,
>>>>>>          topic(email_alerts,
>>>>>>            doc_collection,
>>>>>>            q="email query",
>>>>>>            fl="id, title, abstract",
>>>>>>            id="alertId",
>>>>>>            initialCheckpoint=0)
>>>>>>          )
>>>>>> 
>>>>>> If you have done something like this 'where' would you typically run
>>>> the
>>>>>> daemon, on replicas away from replicas running web queries?
>>>>>> 
>>>>>> Many thanks in advance for any advice / suggestions,
>>>>>> 
>>>>>> Dan
>>>>>> 
>>>>> --
>>>>> Charlie Hull - Managing Consultant at OpenSource Connections Limited
>>>>> <www.o19s.com>
>>>>> Founding member of The Search Network <https://thesearchnetwork.com/>
>>>>> and co-author of Searching the Enterprise
>>>>> <https://opensourceconnections.com/about-us/books-resources/>
>>>>> tel/fax: +44 (0)8700 118334
>>>>> mobile: +44 (0)7767 825828
>>>>> 
>>>>> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
>>>>> Amtsgericht Charlottenburg | HRB 230712 B
>>>>> Geschäftsführer: John M. Woodell | David E. Pugh
>>>>> Finanzamt: Berlin Finanzamt für Körperschaften II
>>>>> 
>> 
>> --
>> Charlie Hull - Managing Consultant at OpenSource Connections Limited
>> <www.o19s.com>
>> Founding member of The Search Network <https://thesearchnetwork.com/>
>> and co-author of Searching the Enterprise
>> <https://opensourceconnections.com/about-us/books-resources/>
>> tel/fax: +44 (0)8700 118334
>> mobile: +44 (0)7767 825828
>> 
>> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
>> Amtsgericht Charlottenburg | HRB 230712 B
>> Geschäftsführer: John M. Woodell | David E. Pugh
>> Finanzamt: Berlin Finanzamt für Körperschaften II
>> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Reply via email to