Re: Email alerts with streaming expressions

2021-09-07 Thread Dan Rosher
Thanks Eric, Charlie and Yuval for all the feedback and suggestions.

Eric: Yes I thought the monitoring might be a it of a pain, esp with
millions of them, I'll have to check out the topic code, but I wondered if
I can look @ the checkpoint collections for uniqueIds that haven't been
updated for a 'while' which might suggest the demon had stopped/died,
rather than checking each daemon individually?

I was also wondering whether it's possible, or a useful enhancement to look
at the replica index version (as opposed to _vesion_ ) for the topic
streaming expression to skip queries where the replica index is the same as
what we might store in the checkpoint collection ? For collections that
update infrequently I think this might be useful.

Charlie: It was for email alerts, so a user stores a query for collection
docs to match against, and then the system emails matches to the user. Do
you think solr-monitor can be used for this purpose?

Yuval: I like the idea of using the UpdateProcessor, at least there's no
need for deamons or monitoring of them, but would this scale for millions
of email queries though?

Many thanks again to all.

Kind regards,
Dan




On Mon, 6 Sept 2021 at 18:47, Yuval Paz  wrote:

> Me and my team are building upon this solcolator:
> https://github.com/SOLR4189/solcolator
>
> Currently the processor is build for Solr 6.5.1, we are working on updating
> our Solr and I hope to release a complete version of our Solcolator  as
> open source then (it will be for version 8.6.x).
>
> Making it an update processor (either make it the last element and replace
> the usual processor that index the document, or by using it as the one from
> last processor in the collection, and so allow monitoring also atomic
> updates [which is relatively costly]).
>
> By making it an update processor we don't rely on the streaming deamon,
> which we found unsatisfying as we wish to allow users to define their own
> monitors over the index.
>
> On Mon, Sep 6, 2021, 8:25 PM Charlie Hull  >
> wrote:
>
> > Are you trying to monitor a stream of emails for certain patterns? In
> > which case you might look at the Lucene Monitor
> >
> >
> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
> > https://issues.apache.org/jira/browse/LUCENE-8766, which was originally
> > Luwak - at my previous company Flax we helped build several large-scale
> > monitoring systems with this https://github.com/flaxsearch/luwak . It's
> > not officially surfaced in Solr yet although my colleague Scott Stults
> > has been working on some ideas: https://github.com/o19s/solr-monitor
> >
> > best
> > Charlie
> >
> > On 06/09/2021 14:32, Dan Rosher wrote:
> > > Hi,
> > >
> > > I was wondering if anyone had tried email alerts with streaming
> > > expressions, and what their experience was if attempting this with say
> 12
> > > million emails / day? Traditionally this might have been done with a
> > > database cursor iterator daily.
> > >
> > > I was thinking if something like the following pseudocode expression
> with
> > > 'kafka' as a custom push expression:
> > >
> > > daemon(id="alertId",
> > > runInterval="1000",
> > > kafka(
> > >  kafka_topic,
> > >  alertId,
> > >  topic(email_alerts,
> > >doc_collection,
> > >q="email query",
> > >fl="id, title, abstract",
> > >id="alertId",
> > >initialCheckpoint=0)
> > >  )
> > >
> > > If you have done something like this 'where' would you typically run
> the
> > > daemon, on replicas away from replicas running web queries?
> > >
> > > Many thanks in advance for any advice / suggestions,
> > >
> > > Dan
> > >
> >
> > --
> > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > 
> > Founding member of The Search Network 
> > and co-author of Searching the Enterprise
> > 
> > tel/fax: +44 (0)8700 118334
> > mobile: +44 (0)7767 825828
> >
> > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> > Amtsgericht Charlottenburg | HRB 230712 B
> > Geschäftsführer: John M. Woodell | David E. Pugh
> > Finanzamt: Berlin Finanzamt für Körperschaften II
> >
>


Solr | OOM | Lazyfield

2021-09-07 Thread HariBabu kuruva
Hi All,

We are getting OOM errors in the solr logs for only specific solr stores.

And in the solr logs we see the below error. Is the OOM error could be
because of the below error.

Also i see below Lazyfield error is spanned across thousands of lines.

Please advise. This is a PROD environment.

-
AsyncLogger error handling event seq=163,
value='Logger=org.apache.solr.handler.RequestHandlerBase Level=ERROR
Message=org.apache.solr.
common.SolrException: ERROR: [doc=1-121180294489] multiple values
encountered for non multiValued field quoteSLOCurrentStatus: [org.apa
che.lucene.document.LazyDocument$LazyField@2b119ab6,
org.apache.lucene.document.LazyDocument$LazyField@70dc55ed,
org.apache.lucene.docu
ment.LazyDocument$LazyField@7dba464d,
org.apache.lucene.document.LazyDocument$LazyField@6cfa4d94,
org.apache.lucene.document.LazyDocume
nt$LazyField@4b19be97,
org.apache.lucene.document.LazyDocument$LazyField@5978e924,
org.apache.lucene.document.LazyDocument$LazyField@76
--


Thanks and Regards,
 Hari
Mobile:9790756568


Re: Email alerts with streaming expressions

2021-09-07 Thread Charlie Hull

Hi Dan,

Yuval and my suggestions both rely on the same underlying code (Luwak, 
now called Lucene Monitor). This lets you store a set of Lucene queries 
and run them against every new document.


The Lucene Monitor allows for very high-performance matching (I know of 
situations with around 1m stored queries, monitoring 1m new documents a 
day running on a few tens of nodes) and it does this with some clever 
optimisations: effectively it builds an index of your stored queries, 
and turns each new document into a query across this index (I know it 
sounds confusing!). It's a 'reverse search'. Check out the original 
Luwak project as it's got links to several presentations and blogs 
showing how others have implemented these systems.


The bit you'll have to build is the Solr layer and then the code that 
uses this to generate alerts - and Solcolator and 
https://github.com/o19s/solr-monitor are two examples of how to do the 
first part, which you can build on. The facility to do a reverse search 
is not built into Solr - yet, unlike Elasticsearch's Percolator.


Best

Charlie

On 07/09/2021 10:24, Dan Rosher wrote:

Thanks Eric, Charlie and Yuval for all the feedback and suggestions.

Eric: Yes I thought the monitoring might be a it of a pain, esp with
millions of them, I'll have to check out the topic code, but I wondered if
I can look @ the checkpoint collections for uniqueIds that haven't been
updated for a 'while' which might suggest the demon had stopped/died,
rather than checking each daemon individually?

I was also wondering whether it's possible, or a useful enhancement to look
at the replica index version (as opposed to _vesion_ ) for the topic
streaming expression to skip queries where the replica index is the same as
what we might store in the checkpoint collection ? For collections that
update infrequently I think this might be useful.

Charlie: It was for email alerts, so a user stores a query for collection
docs to match against, and then the system emails matches to the user. Do
you think solr-monitor can be used for this purpose?

Yuval: I like the idea of using the UpdateProcessor, at least there's no
need for deamons or monitoring of them, but would this scale for millions
of email queries though?

Many thanks again to all.

Kind regards,
Dan




On Mon, 6 Sept 2021 at 18:47, Yuval Paz  wrote:


Me and my team are building upon this solcolator:
https://github.com/SOLR4189/solcolator

Currently the processor is build for Solr 6.5.1, we are working on updating
our Solr and I hope to release a complete version of our Solcolator  as
open source then (it will be for version 8.6.x).

Making it an update processor (either make it the last element and replace
the usual processor that index the document, or by using it as the one from
last processor in the collection, and so allow monitoring also atomic
updates [which is relatively costly]).

By making it an update processor we don't rely on the streaming deamon,
which we found unsatisfying as we wish to allow users to define their own
monitors over the index.

On Mon, Sep 6, 2021, 8:25 PM Charlie Hull 
Are you trying to monitor a stream of emails for certain patterns? In
which case you might look at the Lucene Monitor



https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html

https://issues.apache.org/jira/browse/LUCENE-8766, which was originally
Luwak - at my previous company Flax we helped build several large-scale
monitoring systems with this https://github.com/flaxsearch/luwak . It's
not officially surfaced in Solr yet although my colleague Scott Stults
has been working on some ideas: https://github.com/o19s/solr-monitor

best
Charlie

On 06/09/2021 14:32, Dan Rosher wrote:

Hi,

I was wondering if anyone had tried email alerts with streaming
expressions, and what their experience was if attempting this with say

12

million emails / day? Traditionally this might have been done with a
database cursor iterator daily.

I was thinking if something like the following pseudocode expression

with

'kafka' as a custom push expression:

daemon(id="alertId",
 runInterval="1000",
 kafka(
  kafka_topic,
  alertId,
  topic(email_alerts,
doc_collection,
q="email query",
fl="id, title, abstract",
id="alertId",
initialCheckpoint=0)
  )

If you have done something like this 'where' would you typically run

the

daemon, on replicas away from replicas running web queries?

Many thanks in advance for any advice / suggestions,

Dan


--
Charlie Hull - Managing Consultant at OpenSource Connections Limited

Founding member of The Search Network 
and co-author of Searching the Enterprise

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HR

Re: Solr | OOM | Lazyfield

2021-09-07 Thread Charlie Hull
I doubt these are related, that second error looks like something 
triggered by indexing - attempting to add multiple values to a field 
that hasn't been defined in the schema as multivalued. I'd review your 
indexing  process and schema first.


Charlie

On 07/09/2021 11:00, HariBabu kuruva wrote:

Hi All,

We are getting OOM errors in the solr logs for only specific solr stores.

And in the solr logs we see the below error. Is the OOM error could be
because of the below error.

Also i see below Lazyfield error is spanned across thousands of lines.

Please advise. This is a PROD environment.

-
AsyncLogger error handling event seq=163,
value='Logger=org.apache.solr.handler.RequestHandlerBase Level=ERROR
Message=org.apache.solr.
common.SolrException: ERROR: [doc=1-121180294489] multiple values
encountered for non multiValued field quoteSLOCurrentStatus: [org.apa
che.lucene.document.LazyDocument$LazyField@2b119ab6,
org.apache.lucene.document.LazyDocument$LazyField@70dc55ed,
org.apache.lucene.docu
ment.LazyDocument$LazyField@7dba464d,
org.apache.lucene.document.LazyDocument$LazyField@6cfa4d94,
org.apache.lucene.document.LazyDocume
nt$LazyField@4b19be97,
org.apache.lucene.document.LazyDocument$LazyField@5978e924,
org.apache.lucene.document.LazyDocument$LazyField@76
--


Thanks and Regards,
  Hari
Mobile:9790756568



--
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network  
and co-author of Searching the Enterprise 


tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II


Re: Solr | OOM | Lazyfield

2021-09-07 Thread HariBabu kuruva
Thank you Charlie.

Please let me know if anything is required from my end.

Also, we have added below filed as a work around
false in solrconfig.xml

On Tue, Sep 7, 2021 at 4:03 PM Charlie Hull 
wrote:

> I doubt these are related, that second error looks like something
> triggered by indexing - attempting to add multiple values to a field
> that hasn't been defined in the schema as multivalued. I'd review your
> indexing  process and schema first.
>
> Charlie
>
> On 07/09/2021 11:00, HariBabu kuruva wrote:
> > Hi All,
> >
> > We are getting OOM errors in the solr logs for only specific solr stores.
> >
> > And in the solr logs we see the below error. Is the OOM error could be
> > because of the below error.
> >
> > Also i see below Lazyfield error is spanned across thousands of lines.
> >
> > Please advise. This is a PROD environment.
> >
> > -
> > AsyncLogger error handling event seq=163,
> > value='Logger=org.apache.solr.handler.RequestHandlerBase Level=ERROR
> > Message=org.apache.solr.
> > common.SolrException: ERROR: [doc=1-121180294489] multiple values
> > encountered for non multiValued field quoteSLOCurrentStatus: [org.apa
> > che.lucene.document.LazyDocument$LazyField@2b119ab6,
> > org.apache.lucene.document.LazyDocument$LazyField@70dc55ed,
> > org.apache.lucene.docu
> > ment.LazyDocument$LazyField@7dba464d,
> > org.apache.lucene.document.LazyDocument$LazyField@6cfa4d94,
> > org.apache.lucene.document.LazyDocume
> > nt$LazyField@4b19be97,
> > org.apache.lucene.document.LazyDocument$LazyField@5978e924,
> > org.apache.lucene.document.LazyDocument$LazyField@76
> > --
> >
> >
> > Thanks and Regards,
> >   Hari
> > Mobile:9790756568
> >
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> 
> Founding member of The Search Network 
> and co-author of Searching the Enterprise
> 
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>
> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> Amtsgericht Charlottenburg | HRB 230712 B
> Geschäftsführer: John M. Woodell | David E. Pugh
> Finanzamt: Berlin Finanzamt für Körperschaften II
>


-- 

Thanks and Regards,
 Hari
Mobile:9790756568


Re: Solr | OOM | Lazyfield

2021-09-07 Thread Charlie Hull
Sorry I should have worded that better - I suggest that you review your 
schema file and indexing process, so that you can understand why 
multiple values are being sent to the field 'quoteSLOCurrentStatus' when 
it's probably not defined as multivalued.


C

On 07/09/2021 12:30, HariBabu kuruva wrote:

Thank you Charlie.

Please let me know if anything is required from my end.

Also, we have added below filed as a work around
false in solrconfig.xml

On Tue, Sep 7, 2021 at 4:03 PM Charlie Hull 
wrote:


I doubt these are related, that second error looks like something
triggered by indexing - attempting to add multiple values to a field
that hasn't been defined in the schema as multivalued. I'd review your
indexing  process and schema first.

Charlie

On 07/09/2021 11:00, HariBabu kuruva wrote:

Hi All,

We are getting OOM errors in the solr logs for only specific solr stores.

And in the solr logs we see the below error. Is the OOM error could be
because of the below error.

Also i see below Lazyfield error is spanned across thousands of lines.

Please advise. This is a PROD environment.

-
AsyncLogger error handling event seq=163,
value='Logger=org.apache.solr.handler.RequestHandlerBase Level=ERROR
Message=org.apache.solr.
common.SolrException: ERROR: [doc=1-121180294489] multiple values
encountered for non multiValued field quoteSLOCurrentStatus: [org.apa
che.lucene.document.LazyDocument$LazyField@2b119ab6,
org.apache.lucene.document.LazyDocument$LazyField@70dc55ed,
org.apache.lucene.docu
ment.LazyDocument$LazyField@7dba464d,
org.apache.lucene.document.LazyDocument$LazyField@6cfa4d94,
org.apache.lucene.document.LazyDocume
nt$LazyField@4b19be97,
org.apache.lucene.document.LazyDocument$LazyField@5978e924,
org.apache.lucene.document.LazyDocument$LazyField@76
--


Thanks and Regards,
   Hari
Mobile:9790756568


--
Charlie Hull - Managing Consultant at OpenSource Connections Limited

Founding member of The Search Network 
and co-author of Searching the Enterprise

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II





--
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network  
and co-author of Searching the Enterprise 


tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II


Re: Email alerts with streaming expressions

2021-09-07 Thread Joel Bernstein
There was a design implemented in Streaming Expression for large scale
alerting described here:

https://joelsolr.blogspot.com/2017/01/deploying-solrs-new-parallel-executor.html

In this design you would store each alert in Solr as a topic expression.
Then a single daemon can run all the topics or it can be parallelized.



Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Sep 7, 2021 at 6:32 AM Charlie Hull 
wrote:

> Hi Dan,
>
> Yuval and my suggestions both rely on the same underlying code (Luwak,
> now called Lucene Monitor). This lets you store a set of Lucene queries
> and run them against every new document.
>
> The Lucene Monitor allows for very high-performance matching (I know of
> situations with around 1m stored queries, monitoring 1m new documents a
> day running on a few tens of nodes) and it does this with some clever
> optimisations: effectively it builds an index of your stored queries,
> and turns each new document into a query across this index (I know it
> sounds confusing!). It's a 'reverse search'. Check out the original
> Luwak project as it's got links to several presentations and blogs
> showing how others have implemented these systems.
>
> The bit you'll have to build is the Solr layer and then the code that
> uses this to generate alerts - and Solcolator and
> https://github.com/o19s/solr-monitor are two examples of how to do the
> first part, which you can build on. The facility to do a reverse search
> is not built into Solr - yet, unlike Elasticsearch's Percolator.
>
> Best
>
> Charlie
>
> On 07/09/2021 10:24, Dan Rosher wrote:
> > Thanks Eric, Charlie and Yuval for all the feedback and suggestions.
> >
> > Eric: Yes I thought the monitoring might be a it of a pain, esp with
> > millions of them, I'll have to check out the topic code, but I wondered
> if
> > I can look @ the checkpoint collections for uniqueIds that haven't been
> > updated for a 'while' which might suggest the demon had stopped/died,
> > rather than checking each daemon individually?
> >
> > I was also wondering whether it's possible, or a useful enhancement to
> look
> > at the replica index version (as opposed to _vesion_ ) for the topic
> > streaming expression to skip queries where the replica index is the same
> as
> > what we might store in the checkpoint collection ? For collections that
> > update infrequently I think this might be useful.
> >
> > Charlie: It was for email alerts, so a user stores a query for collection
> > docs to match against, and then the system emails matches to the user. Do
> > you think solr-monitor can be used for this purpose?
> >
> > Yuval: I like the idea of using the UpdateProcessor, at least there's no
> > need for deamons or monitoring of them, but would this scale for millions
> > of email queries though?
> >
> > Many thanks again to all.
> >
> > Kind regards,
> > Dan
> >
> >
> >
> >
> > On Mon, 6 Sept 2021 at 18:47, Yuval Paz 
> wrote:
> >
> >> Me and my team are building upon this solcolator:
> >> https://github.com/SOLR4189/solcolator
> >>
> >> Currently the processor is build for Solr 6.5.1, we are working on
> updating
> >> our Solr and I hope to release a complete version of our Solcolator  as
> >> open source then (it will be for version 8.6.x).
> >>
> >> Making it an update processor (either make it the last element and
> replace
> >> the usual processor that index the document, or by using it as the one
> from
> >> last processor in the collection, and so allow monitoring also atomic
> >> updates [which is relatively costly]).
> >>
> >> By making it an update processor we don't rely on the streaming deamon,
> >> which we found unsatisfying as we wish to allow users to define their
> own
> >> monitors over the index.
> >>
> >> On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <
> ch...@opensourceconnections.com
> >> wrote:
> >>
> >>> Are you trying to monitor a stream of emails for certain patterns? In
> >>> which case you might look at the Lucene Monitor
> >>>
> >>>
> >>
> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
> >>> https://issues.apache.org/jira/browse/LUCENE-8766, which was
> originally
> >>> Luwak - at my previous company Flax we helped build several large-scale
> >>> monitoring systems with this https://github.com/flaxsearch/luwak .
> It's
> >>> not officially surfaced in Solr yet although my colleague Scott Stults
> >>> has been working on some ideas: https://github.com/o19s/solr-monitor
> >>>
> >>> best
> >>> Charlie
> >>>
> >>> On 06/09/2021 14:32, Dan Rosher wrote:
>  Hi,
> 
>  I was wondering if anyone had tried email alerts with streaming
>  expressions, and what their experience was if attempting this with say
> >> 12
>  million emails / day? Traditionally this might have been done with a
>  database cursor iterator daily.
> 
>  I was thinking if something like the following pseudocode expression
> >> with
>  'kafka' as a custom push expression:
> 
>  daemon(id="alertId

Re: Solr | OOM | Lazyfield

2021-09-07 Thread HariBabu kuruva
Hi Charlie,

The multiple values error is not occurring only for that field. Its also
occurring for multiple fields and multiple stores(collections).
Ex:
2021-09-07 13:05:53.646 ERROR (qtp1198197478-1200) [c:quoteStore s:shard1
r:core_node8 x:quoteStore_shard1_replica_n7] o.a.s.h.RequestH
andlerBase org.apache.solr.common.SolrException: ERROR:
[doc=1-116339633161] multiple values encountered for non multiValued field
quot
eAgeCompleted: [N, N]

Our Dev Team says that its only after the Solr upgrade from 8.1 to 8.8.1

Also, Solr is going down very frequently (after every one hour) on the
nodes only where a specific store (quotestore) is available. Other nodes
are fine.
For the nodes which are going down, I see OOM error for one node and for
the other node we see zookeeper connection timeout error as below.


2021-09-07 13:06:13.907 WARN  (main-SendThread(
zookeeperhost.corp.equinix.com:2185)) [   ] o.a.z.ClientCnxn Client session
timed out, ha
ve not heard from server in 23226ms for session id 0x4139b3e00d3
2021-09-07 13:06:13.907 WARN  (main-SendThread(
lxeisprdas10.corp.equinix.com:2185)) [   ] o.a.z.ClientCnxn Session
0x4139b3e00d3 fo
r sever zookeeperhost.corp.equinix.com/10.250.12.54:2185, Closing socket
connection. Attempting reconnect except it is a SessionExpiredE
xception. => org.apache.zookeeper.ClientCnxn$SessionTimeoutException:
Client session timed out, have not heard from server in 23226ms f
or session id 0x4139b3e00d3
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243)
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session
timed out, have not heard from server in 23226ms for session id
 0x4139b3e00d3
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243)
~[zookeeper-3.6.2.jar:3.6.2]
2021-09-07 13:06:14.010 WARN  (zkConnectionManagerCallback-13-thread-1) [
] o.a.s.c.c.ConnectionManager Watcher org.apache.solr.commo
n.cloud.ConnectionManager@5521aab name: ZooKeeperConnection Watcher:
zk1.corp.equinix.com:2181,zk2.corp.equinix.com:21
82,zk3.corp.equinix.com:2183,zk4.corp.equinix.com:2184,
zk5.corp.equinix.com:2185 got event WatchedEvent stat
e:Disconnected type:None path:null path: null type: None
2021-09-07 13:06:14.010 WARN  (zkConnectionManagerCallback-13-thread-1) [
] o.a.s.c.c.ConnectionManager zkClient has disconnected
2021-09-07 13:06:39.494 WARN  (main-SendThread(
lxeisprdas09.corp.equinix.com:2184)) [   ] o.a.z.ClientCnxn Client session
timed out, ha
ve not heard from server in 23533ms for session id 0x4139b3e00d3
2021-09-07 13:06:39.494 WARN  (main-SendThread(
lxeisprdas09.corp.equinix.com:2184)) [   ] o.a.z.ClientCnxn Session
0x4139b3e00d3 fo
r sever zk4.corp.equinix.com/10.250.**.**:2184, Closing socket connection.
Attempting reconnect except it is a SessionExpiredE
xception. => org.apache.zookeeper.ClientCnxn$SessionTimeoutException:
Client session timed out, have not heard from server in 23533ms f
or session id 0x4139b3e00d3
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243)
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session
timed out, have not heard from server in 23533ms for session id
 0x4139b3e00d3
---

Could you please suggest.







On Tue, Sep 7, 2021 at 5:49 PM Charlie Hull 
wrote:

> Sorry I should have worded that better - I suggest that you review your
> schema file and indexing process, so that you can understand why
> multiple values are being sent to the field 'quoteSLOCurrentStatus' when
> it's probably not defined as multivalued.
>
> C
>
> On 07/09/2021 12:30, HariBabu kuruva wrote:
> > Thank you Charlie.
> >
> > Please let me know if anything is required from my end.
> >
> > Also, we have added below filed as a work around
> > false in solrconfig.xml
> >
> > On Tue, Sep 7, 2021 at 4:03 PM Charlie Hull <
> ch...@opensourceconnections.com>
> > wrote:
> >
> >> I doubt these are related, that second error looks like something
> >> triggered by indexing - attempting to add multiple values to a field
> >> that hasn't been defined in the schema as multivalued. I'd review your
> >> indexing  process and schema first.
> >>
> >> Charlie
> >>
> >> On 07/09/2021 11:00, HariBabu kuruva wrote:
> >>> Hi All,
> >>>
> >>> We are getting OOM errors in the solr logs for only specific solr
> stores.
> >>>
> >>> And in the solr logs we see the below error. Is the OOM error could be
> >>> because of the below error.
> >>>
> >>> Also i see below Lazyfield error is spanned across thousands of lines.
> >>>
> >>> Please advise. This is a PROD environment.
> >>>
> >>> -
> >>> AsyncLogger error handling event seq=163,
> >>> value='Logger=org.apache.solr.handler.RequestHandlerBase Level=ERROR
> >>> Message=org.apache.solr.
> >>> common.SolrException: ERROR: [doc=1-121180294489] multiple values
> >>> encountered for non multiValue

Re: Email alerts with streaming expressions

2021-09-07 Thread Eric Pugh
Also, I think this is something you could easily trial, just take out the Kafka 
step, and replace it with say a insert into a solr collection, and see what 
happens.

Monitoring the daemon process is easy too  ;-)


> On Sep 7, 2021, at 8:50 AM, Joel Bernstein  wrote:
> 
> There was a design implemented in Streaming Expression for large scale
> alerting described here:
> 
> https://joelsolr.blogspot.com/2017/01/deploying-solrs-new-parallel-executor.html
> 
> In this design you would store each alert in Solr as a topic expression.
> Then a single daemon can run all the topics or it can be parallelized.
> 
> 
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> 
> On Tue, Sep 7, 2021 at 6:32 AM Charlie Hull 
> wrote:
> 
>> Hi Dan,
>> 
>> Yuval and my suggestions both rely on the same underlying code (Luwak,
>> now called Lucene Monitor). This lets you store a set of Lucene queries
>> and run them against every new document.
>> 
>> The Lucene Monitor allows for very high-performance matching (I know of
>> situations with around 1m stored queries, monitoring 1m new documents a
>> day running on a few tens of nodes) and it does this with some clever
>> optimisations: effectively it builds an index of your stored queries,
>> and turns each new document into a query across this index (I know it
>> sounds confusing!). It's a 'reverse search'. Check out the original
>> Luwak project as it's got links to several presentations and blogs
>> showing how others have implemented these systems.
>> 
>> The bit you'll have to build is the Solr layer and then the code that
>> uses this to generate alerts - and Solcolator and
>> https://github.com/o19s/solr-monitor are two examples of how to do the
>> first part, which you can build on. The facility to do a reverse search
>> is not built into Solr - yet, unlike Elasticsearch's Percolator.
>> 
>> Best
>> 
>> Charlie
>> 
>> On 07/09/2021 10:24, Dan Rosher wrote:
>>> Thanks Eric, Charlie and Yuval for all the feedback and suggestions.
>>> 
>>> Eric: Yes I thought the monitoring might be a it of a pain, esp with
>>> millions of them, I'll have to check out the topic code, but I wondered
>> if
>>> I can look @ the checkpoint collections for uniqueIds that haven't been
>>> updated for a 'while' which might suggest the demon had stopped/died,
>>> rather than checking each daemon individually?
>>> 
>>> I was also wondering whether it's possible, or a useful enhancement to
>> look
>>> at the replica index version (as opposed to _vesion_ ) for the topic
>>> streaming expression to skip queries where the replica index is the same
>> as
>>> what we might store in the checkpoint collection ? For collections that
>>> update infrequently I think this might be useful.
>>> 
>>> Charlie: It was for email alerts, so a user stores a query for collection
>>> docs to match against, and then the system emails matches to the user. Do
>>> you think solr-monitor can be used for this purpose?
>>> 
>>> Yuval: I like the idea of using the UpdateProcessor, at least there's no
>>> need for deamons or monitoring of them, but would this scale for millions
>>> of email queries though?
>>> 
>>> Many thanks again to all.
>>> 
>>> Kind regards,
>>> Dan
>>> 
>>> 
>>> 
>>> 
>>> On Mon, 6 Sept 2021 at 18:47, Yuval Paz 
>> wrote:
>>> 
 Me and my team are building upon this solcolator:
 https://github.com/SOLR4189/solcolator
 
 Currently the processor is build for Solr 6.5.1, we are working on
>> updating
 our Solr and I hope to release a complete version of our Solcolator  as
 open source then (it will be for version 8.6.x).
 
 Making it an update processor (either make it the last element and
>> replace
 the usual processor that index the document, or by using it as the one
>> from
 last processor in the collection, and so allow monitoring also atomic
 updates [which is relatively costly]).
 
 By making it an update processor we don't rely on the streaming deamon,
 which we found unsatisfying as we wish to allow users to define their
>> own
 monitors over the index.
 
 On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <
>> ch...@opensourceconnections.com
 wrote:
 
> Are you trying to monitor a stream of emails for certain patterns? In
> which case you might look at the Lucene Monitor
> 
> 
 
>> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
> https://issues.apache.org/jira/browse/LUCENE-8766, which was
>> originally
> Luwak - at my previous company Flax we helped build several large-scale
> monitoring systems with this https://github.com/flaxsearch/luwak .
>> It's
> not officially surfaced in Solr yet although my colleague Scott Stults
> has been working on some ideas: https://github.com/o19s/solr-monitor
> 
> best
> Charlie
> 
> On 06/09/2021 14:32, Dan Rosher wrote:
>> Hi,
>> 
>> I was wondering if anyone had tried email alerts with streami

Re: trailing space added to fields

2021-09-07 Thread Alexandre Rafalovitch
The general answer is to add UpdateRequestProcessor pipeline. That gives
you a lot of post processing flexibility.

But you can also try having the xpath specify  /text(), maybe that will
deal with space specifically.  Did not test it myself though, just a
thought.

Regards,
Alex

On Mon., Sep. 6, 2021, 11:10 p.m. Scott Derrick,  wrote:

> I'm indexing .xml documents and using the XPathEntityProcessor for data
> importing.  Here is a snippet of my conf file
>
>dataSource="myfilereader"
>processor="XPathEntityProcessor"
>url="${jcurrent.fileAbsolutePath}"
>stream="false"
>forEach="/TEI/teiHeader/fileDesc"
>xsl="xslt/meta.xsl"
>>
> flatten="true"/>
>
>
>
> flatten="true" />
>
>
>
>
>
>
>
> />
>
>
> flatten="true" />
>
>
>
> 
>
> I noticed spaces at the ends of my elements when exporting a result into
> json or xml.
>
> I thought is was my javascript fetch call that was appending the string
> but looking at the query page on the solr admin site I can clearly see a
> trailing space.  Doesn't matter how the field is stored string or
> text_general is the same.
>
> here is a snippet of the query response
>
> |{ "date":"1884-09-09 September 9, 1884 ", "note":"Handwritten by Mary on
> a postcard from Boston, Massachusetts. ", "country":"USA ",
> "origGeo":"42.3584308 -71.0597732 ", "author":"Mary ", "authorString":"Mary
> ", "origin":"1884-09-09 ",
> "originSort":"1884-09-09 ", "accession":"639P3.65.026 ",
> "accessionSort":"639P3.65.026 ", "title":"\n Mary to Mary Baker Eddy, \n
> September 9, 1884 \n \n ", "titleSort":"\n Mary to Mary Baker Eddy, \n
> September 9, 1884 \n \n ", "when":"1884-09-09 ",
> "settlement":"Boston ", "recipient":"Mary Baker Eddy",
> "recipientString":"Mary Baker Eddy", "publisher":"The Mary Baker Eddy
> Library ", "origPlace":"places.xml#boston_ma ", "region":"MA ",
> "type":"incoming_correspondence", "places":"Boston ",
> "placesString":"Boston ", "people":"Mary ", "peopleString":"Mary ",
> "body":"Paper rec received Thanks, Just looked it over, good . Have moved
> at last! Will find me at cor: Shawmut Ave. & Pleasant St. a few doors from
> 66 S. Ave, further downtown. Hope
> you will find time to come in. Not yet settled, but like much better. Hope
> you are prospering. Wanted to see you last Sabbath eve but too tired In
> love Mary – ", "closer":"Boston Sept 9. 1884 . ",
> "id":"3272bf21-e6c2-4053-85ef-db3ec5a7f0ae",
> "_version_":1710182653070671872},|
>
>
>
> I'm guessing its the XPathEntityProcessor that is doing it but I'm
> certainly open to pilot error!
>
> Any ideas how I can get rid of the trailing space?
>
> thanks,
>
> Scott
>
>
>


Using CloudSolrClient with Basic Authentication

2021-09-07 Thread Gerald Bonfiglio
We're using Solr 6.6.6, with SolrJ 7.7.1.

We've started on supporting use of Basic Authentication with Solr, so we need 
to include credentials when connecting and sending requests.  It's clear that 
we can include the credentials in each SolrRequest.  However, we are not always 
building individual requests, but using many of the helper methods of 
CloudSolrClient inherited from SolrClient (e.g. add, deleteById, deleteByQuery, 
etc.).

However, I'm not able to even get a CloudSolrClient instance built, because I 
cannot find a way to provide the credentials for the builder to use, which I 
seem to need when using Solr urls instead of Zookeeper connection to build the 
client.

This is a snippet of what I'm doing:

CloudSolrClient.Builder builder = new CloudSolrClient.Builder(solrUrls);
builder.withParallelUpdates(true);
CloudSolrClient solrServer = builder.build();

builder.build() throws an exception, that authentication is required.

Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://localhost:8983/solr: Expected mime type 
application/octet-stream but got text/html. 


Error 401 require authentication

HTTP ERROR 401
Problem accessing /solr/admin/collections. Reason:
require authentication



Stepping through it, exception is coming from:

stateProvider = new HttpClusterStateProvider(solrUrls, httpClient);

Searching the web, I've seen several possible solutions, all of which seem 
clumsy (creating custom HttpClients and/or LBHttpSolrClients), and in one case, 
using a global CredentialsProvider, will not work in our case, as we have 
support in our product to use several distinct SolrCloud clusters, each of 
which may have different credentials.  For something so fundamental, it seems 
awfully difficult.  Are these the only ways to use support authentication?  As 
mentioned above, we are not always building SolrRequest structures directly, so 
would be optimal to have the client connections include the credentials 
automatically.

Thank you for your assistance.
Jerry





[Nastel  Technologies]

The information contained in this e-mail and in any attachment is confidential 
and
is intended solely for the use of the individual or entity to which it is 
addressed.
Access, copying, disclosure or use of such information by anyone else is 
unauthorized.
If you are not the intended recipient, please delete the e-mail and refrain 
from use of such information.


Re: Solr | OOM | Lazyfield

2021-09-07 Thread Charlie Hull
I don't know, but perhaps 8.8.1 checks it more than 8.1? - it's still 
worth your dev team reviewing as I suggested as it may be linked to why 
the nodes are going down. The error is occurring for a reason.


Charlie

On 07/09/2021 14:30, HariBabu kuruva wrote:

Hi Charlie,

The multiple values error is not occurring only for that field. Its also
occurring for multiple fields and multiple stores(collections).
Ex:
2021-09-07 13:05:53.646 ERROR (qtp1198197478-1200) [c:quoteStore s:shard1
r:core_node8 x:quoteStore_shard1_replica_n7] o.a.s.h.RequestH
andlerBase org.apache.solr.common.SolrException: ERROR:
[doc=1-116339633161] multiple values encountered for non multiValued field
quot
eAgeCompleted: [N, N]

Our Dev Team says that its only after the Solr upgrade from 8.1 to 8.8.1

Also, Solr is going down very frequently (after every one hour) on the
nodes only where a specific store (quotestore) is available. Other nodes
are fine.
For the nodes which are going down, I see OOM error for one node and for
the other node we see zookeeper connection timeout error as below.


2021-09-07 13:06:13.907 WARN  (main-SendThread(
zookeeperhost.corp.equinix.com:2185)) [   ] o.a.z.ClientCnxn Client session
timed out, ha
ve not heard from server in 23226ms for session id 0x4139b3e00d3
2021-09-07 13:06:13.907 WARN  (main-SendThread(
lxeisprdas10.corp.equinix.com:2185)) [   ] o.a.z.ClientCnxn Session
0x4139b3e00d3 fo
r sever zookeeperhost.corp.equinix.com/10.250.12.54:2185, Closing socket
connection. Attempting reconnect except it is a SessionExpiredE
xception. => org.apache.zookeeper.ClientCnxn$SessionTimeoutException:
Client session timed out, have not heard from server in 23226ms f
or session id 0x4139b3e00d3
 at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243)
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session
timed out, have not heard from server in 23226ms for session id
  0x4139b3e00d3
 at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243)
~[zookeeper-3.6.2.jar:3.6.2]
2021-09-07 13:06:14.010 WARN  (zkConnectionManagerCallback-13-thread-1) [
] o.a.s.c.c.ConnectionManager Watcher org.apache.solr.commo
n.cloud.ConnectionManager@5521aab name: ZooKeeperConnection Watcher:
zk1.corp.equinix.com:2181,zk2.corp.equinix.com:21
82,zk3.corp.equinix.com:2183,zk4.corp.equinix.com:2184,
zk5.corp.equinix.com:2185 got event WatchedEvent stat
e:Disconnected type:None path:null path: null type: None
2021-09-07 13:06:14.010 WARN  (zkConnectionManagerCallback-13-thread-1) [
] o.a.s.c.c.ConnectionManager zkClient has disconnected
2021-09-07 13:06:39.494 WARN  (main-SendThread(
lxeisprdas09.corp.equinix.com:2184)) [   ] o.a.z.ClientCnxn Client session
timed out, ha
ve not heard from server in 23533ms for session id 0x4139b3e00d3
2021-09-07 13:06:39.494 WARN  (main-SendThread(
lxeisprdas09.corp.equinix.com:2184)) [   ] o.a.z.ClientCnxn Session
0x4139b3e00d3 fo
r sever zk4.corp.equinix.com/10.250.**.**:2184, Closing socket connection.
Attempting reconnect except it is a SessionExpiredE
xception. => org.apache.zookeeper.ClientCnxn$SessionTimeoutException:
Client session timed out, have not heard from server in 23533ms f
or session id 0x4139b3e00d3
 at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243)
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session
timed out, have not heard from server in 23533ms for session id
  0x4139b3e00d3
---

Could you please suggest.







On Tue, Sep 7, 2021 at 5:49 PM Charlie Hull 
wrote:


Sorry I should have worded that better - I suggest that you review your
schema file and indexing process, so that you can understand why
multiple values are being sent to the field 'quoteSLOCurrentStatus' when
it's probably not defined as multivalued.

C

On 07/09/2021 12:30, HariBabu kuruva wrote:

Thank you Charlie.

Please let me know if anything is required from my end.

Also, we have added below filed as a work around
false in solrconfig.xml

On Tue, Sep 7, 2021 at 4:03 PM Charlie Hull <

ch...@opensourceconnections.com>

wrote:


I doubt these are related, that second error looks like something
triggered by indexing - attempting to add multiple values to a field
that hasn't been defined in the schema as multivalued. I'd review your
indexing  process and schema first.

Charlie

On 07/09/2021 11:00, HariBabu kuruva wrote:

Hi All,

We are getting OOM errors in the solr logs for only specific solr

stores.

And in the solr logs we see the below error. Is the OOM error could be
because of the below error.

Also i see below Lazyfield error is spanned across thousands of lines.

Please advise. This is a PROD environment.

-
AsyncLogger error handling event seq=163,
value='Logger=org.apache.solr.handler.RequestHandlerBase Level=ERROR
Message=org.apache.solr.
common.Sol

Re: Index dependent groups of data

2021-09-07 Thread lstusr 5u93n4
> How long are you waiting between the hard commit and the query?
> Are you waiting for the commit operation to return a response before you
try to
> query?

Well that's kind of the crux of the issue. We're issuing a hard commit
which (from what I've read) appears to be a synchronous operation. So. when
the call comes back with a 200 http response code, we can be assured that
the operation has gone through. But there's no artificial "wait time" after
that, because how are we to know how long that should be?

> I actually don't know whether a commit operation will wait for
> all replicas when you're in cloud mode.

Seems like our experimentation is showing that it doesn't at least for TLOG
replica types. If we bound the query to the leaders, we can get accurate
results immediately after the commit. If we don't add that restriction,
sometimes the results sometimes won't show the groups of data that were
indexed in the previous.

At this point we're proceeding with a strategy of only querying the leaders
during this operation... Seems to be working out so far.

Thanks!

Kyle




On Fri, 3 Sept 2021 at 12:16, Shawn Heisey  wrote:

> On 9/3/2021 9:19 AM, lstusr 5u93n4 wrote:
> > What we're seeing is the following:
> >   - index some data
> >   - issue a hard commit
> >   - issue a query for that data
> >   - sometimes the query gets routed to a replica that is not yet updated,
> > and doesn't contain the data.
>
> How long are you waiting between the hard commit and the query? Are you
> waiting for the commit operation to return a response before you try to
> query?  I actually don't know whether a commit operation will wait for
> all replicas when you're in cloud mode.  I don't have a lot of
> experience with SolrCloud yet.  I did set up a cloud deployment at an
> old job, but it was VERY small.  All my large-index experience is in
> standalone mode.
>
> Commits can sometimes be very slow.  This is mostly dependent on your
> cache autowarm configuration and any manual warming queries that you
> have defined.
>
> Thanks,
> Shawn
>
>


Re: Index dependent groups of data

2021-09-07 Thread lstusr 5u93n4
>  Is there a particular reason for using TLOG replica types?

We used to use NRT replica types, but we switched to TLOG a year or two ago
in order to prioritize indexing speed above all else, understanding that it
might take a while for query results to be identical across replicas. This
is the first time we've had a use case where we need to query immediately
after indexing. Had we known then what we know now, maybe we wouldn't have
switched... but that's hindsight I guess.

With an NRT replica type, do you know if we issue a commit does it apply to
all replicas? We're not too far down the path that we couldn't switch back,
and I assume that the effect would be minimized if we did so. However, I'd
like to know that the issue would be completely GONE, not just reduced in
frequency if we did switch back...

Thanks!

Kyle

On Fri, 3 Sept 2021 at 13:02, Nick Vladiceanu 
wrote:

> Is there a particular reason for using TLOG replica types? For such a
> small cluster and the scenario you’ve described it sounds more reasonable
> to use NRT, that will (almost) guarantee that once you write your data -
> it’ll be (almost) immediately available on all the nodes.
>
>
> > On 3. Sep 2021, at 6:16 PM, Shawn Heisey  wrote:
> >
> > On 9/3/2021 9:19 AM, lstusr 5u93n4 wrote:
> >> What we're seeing is the following:
> >>  - index some data
> >>  - issue a hard commit
> >>  - issue a query for that data
> >>  - sometimes the query gets routed to a replica that is not yet updated,
> >> and doesn't contain the data.
> >
> > How long are you waiting between the hard commit and the query? Are you
> waiting for the commit operation to return a response before you try to
> query?  I actually don't know whether a commit operation will wait for all
> replicas when you're in cloud mode.  I don't have a lot of experience with
> SolrCloud yet.  I did set up a cloud deployment at an old job, but it was
> VERY small.  All my large-index experience is in standalone mode.
> >
> > Commits can sometimes be very slow.  This is mostly dependent on your
> cache autowarm configuration and any manual warming queries that you have
> defined.
> >
> > Thanks,
> > Shawn
> >
>
>


Re: Index dependent groups of data

2021-09-07 Thread Walter Underwood
> On Sep 7, 2021, at 9:01 AM, lstusr 5u93n4  wrote:
> 
> Well that's kind of the crux of the issue. We're issuing a hard commit
> which (from what I've read) appears to be a synchronous operation. So. when
> the call comes back with a 200 http response code, we can be assured that
> the operation has gone through

This is your mistake. Solr is not transactional. You are assuming ACID 
properties,
but Solr does not guarantee those, especially cluster-wide.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Index dependent groups of data

2021-09-07 Thread Walter Underwood
How about doing your queries against the leader only?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 7, 2021, at 9:06 AM, lstusr 5u93n4  wrote:
> 
>> Is there a particular reason for using TLOG replica types?
> 
> We used to use NRT replica types, but we switched to TLOG a year or two ago
> in order to prioritize indexing speed above all else, understanding that it
> might take a while for query results to be identical across replicas. This
> is the first time we've had a use case where we need to query immediately
> after indexing. Had we known then what we know now, maybe we wouldn't have
> switched... but that's hindsight I guess.
> 
> With an NRT replica type, do you know if we issue a commit does it apply to
> all replicas? We're not too far down the path that we couldn't switch back,
> and I assume that the effect would be minimized if we did so. However, I'd
> like to know that the issue would be completely GONE, not just reduced in
> frequency if we did switch back...
> 
> Thanks!
> 
> Kyle
> 
> On Fri, 3 Sept 2021 at 13:02, Nick Vladiceanu 
> wrote:
> 
>> Is there a particular reason for using TLOG replica types? For such a
>> small cluster and the scenario you’ve described it sounds more reasonable
>> to use NRT, that will (almost) guarantee that once you write your data -
>> it’ll be (almost) immediately available on all the nodes.
>> 
>> 
>>> On 3. Sep 2021, at 6:16 PM, Shawn Heisey  wrote:
>>> 
>>> On 9/3/2021 9:19 AM, lstusr 5u93n4 wrote:
 What we're seeing is the following:
 - index some data
 - issue a hard commit
 - issue a query for that data
 - sometimes the query gets routed to a replica that is not yet updated,
 and doesn't contain the data.
>>> 
>>> How long are you waiting between the hard commit and the query? Are you
>> waiting for the commit operation to return a response before you try to
>> query?  I actually don't know whether a commit operation will wait for all
>> replicas when you're in cloud mode.  I don't have a lot of experience with
>> SolrCloud yet.  I did set up a cloud deployment at an old job, but it was
>> VERY small.  All my large-index experience is in standalone mode.
>>> 
>>> Commits can sometimes be very slow.  This is mostly dependent on your
>> cache autowarm configuration and any manual warming queries that you have
>> defined.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>> 
>> 



Re: Index dependent groups of data

2021-09-07 Thread lstusr 5u93n4
>  How about doing your queries against the leader only?

This seems to work. We haven't been able to produce an instance where the
primary data isn't there in the case where we bound the queries only to the
leaders.

> Solr is not transactional. You are assuming ACID properties,
> but Solr does not guarantee those, especially cluster-wide.

Yeah, understood. Trying to determine if there's a way we could
understand if a save + commit + query (optionally to leader) approaches a
"transaction", or if that's simply a non-starter given Solr's nature.

Kyle

On Tue, 7 Sept 2021 at 12:14, Walter Underwood 
wrote:

> How about doing your queries against the leader only?
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Sep 7, 2021, at 9:06 AM, lstusr 5u93n4  wrote:
> >
> >> Is there a particular reason for using TLOG replica types?
> >
> > We used to use NRT replica types, but we switched to TLOG a year or two
> ago
> > in order to prioritize indexing speed above all else, understanding that
> it
> > might take a while for query results to be identical across replicas.
> This
> > is the first time we've had a use case where we need to query immediately
> > after indexing. Had we known then what we know now, maybe we wouldn't
> have
> > switched... but that's hindsight I guess.
> >
> > With an NRT replica type, do you know if we issue a commit does it apply
> to
> > all replicas? We're not too far down the path that we couldn't switch
> back,
> > and I assume that the effect would be minimized if we did so. However,
> I'd
> > like to know that the issue would be completely GONE, not just reduced in
> > frequency if we did switch back...
> >
> > Thanks!
> >
> > Kyle
> >
> > On Fri, 3 Sept 2021 at 13:02, Nick Vladiceanu 
> > wrote:
> >
> >> Is there a particular reason for using TLOG replica types? For such a
> >> small cluster and the scenario you’ve described it sounds more
> reasonable
> >> to use NRT, that will (almost) guarantee that once you write your data -
> >> it’ll be (almost) immediately available on all the nodes.
> >>
> >>
> >>> On 3. Sep 2021, at 6:16 PM, Shawn Heisey  wrote:
> >>>
> >>> On 9/3/2021 9:19 AM, lstusr 5u93n4 wrote:
>  What we're seeing is the following:
>  - index some data
>  - issue a hard commit
>  - issue a query for that data
>  - sometimes the query gets routed to a replica that is not yet
> updated,
>  and doesn't contain the data.
> >>>
> >>> How long are you waiting between the hard commit and the query? Are you
> >> waiting for the commit operation to return a response before you try to
> >> query?  I actually don't know whether a commit operation will wait for
> all
> >> replicas when you're in cloud mode.  I don't have a lot of experience
> with
> >> SolrCloud yet.  I did set up a cloud deployment at an old job, but it
> was
> >> VERY small.  All my large-index experience is in standalone mode.
> >>>
> >>> Commits can sometimes be very slow.  This is mostly dependent on your
> >> cache autowarm configuration and any manual warming queries that you
> have
> >> defined.
> >>>
> >>> Thanks,
> >>> Shawn
> >>>
> >>
> >>
>
>


Re: trailing space added to fields

2021-09-07 Thread Scott Derrick

Alexandre,

perfect!!!

There is a built in white space trim factory, 
TrimFieldUpdateProcessorFactory that I added to the default chain and now all 
is good!

thanks again,

Scott

On 9/7/21 7:32 AM, Alexandre Rafalovitch wrote:

The general answer is to add UpdateRequestProcessor pipeline. That gives
you a lot of post processing flexibility.

But you can also try having the xpath specify  /text(), maybe that will
deal with space specifically.  Did not test it myself though, just a
thought.

Regards,
 Alex

On Mon., Sep. 6, 2021, 11:10 p.m. Scott Derrick,  wrote:


I'm indexing .xml documents and using the XPathEntityProcessor for data
importing.  Here is a snippet of my conf file





















 

I noticed spaces at the ends of my elements when exporting a result into
json or xml.

I thought is was my javascript fetch call that was appending the string
but looking at the query page on the solr admin site I can clearly see a
trailing space.  Doesn't matter how the field is stored string or
text_general is the same.

here is a snippet of the query response

|{ "date":"1884-09-09 September 9, 1884 ", "note":"Handwritten by Mary on
a postcard from Boston, Massachusetts. ", "country":"USA ",
"origGeo":"42.3584308 -71.0597732 ", "author":"Mary ", "authorString":"Mary
", "origin":"1884-09-09 ",
"originSort":"1884-09-09 ", "accession":"639P3.65.026 ",
"accessionSort":"639P3.65.026 ", "title":"\n Mary to Mary Baker Eddy, \n
September 9, 1884 \n \n ", "titleSort":"\n Mary to Mary Baker Eddy, \n
September 9, 1884 \n \n ", "when":"1884-09-09 ",
"settlement":"Boston ", "recipient":"Mary Baker Eddy",
"recipientString":"Mary Baker Eddy", "publisher":"The Mary Baker Eddy
Library ", "origPlace":"places.xml#boston_ma ", "region":"MA ",
"type":"incoming_correspondence", "places":"Boston ",
"placesString":"Boston ", "people":"Mary ", "peopleString":"Mary ",
"body":"Paper rec received Thanks, Just looked it over, good . Have moved
at last! Will find me at cor: Shawmut Ave. & Pleasant St. a few doors from
66 S. Ave, further downtown. Hope
you will find time to come in. Not yet settled, but like much better. Hope
you are prospering. Wanted to see you last Sabbath eve but too tired In
love Mary – ", "closer":"Boston Sept 9. 1884 . ",
"id":"3272bf21-e6c2-4053-85ef-db3ec5a7f0ae",
"_version_":1710182653070671872},|



I'm guessing its the XPathEntityProcessor that is doing it but I'm
certainly open to pilot error!

Any ideas how I can get rid of the trailing space?

thanks,

Scott










Limits + locks when multiple clients /update at once

2021-09-07 Thread Andy Lester
* Are there any constraints as to how to safely call the /update handler for a 
core from multiple clients?

* Are there locking issues we should be aware of?

* Is there any way multiple /update calls can corrupt a core?

Backstory:

Back in 2013, when we first started using Solr 4.2.0, we had problems with our 
core getting corrupted if we tried to have multiple processes run the 
DataImportHandler at the same time.  We updated our app to make sure that could 
no longer happen.  Everything was fine, and we lived on Solr 4.2 for years.

Now, we are running 8.9, and we have moved our indexer from using the DIH to 
using /update handlers.  That works very nicely as well.  We have kept the same 
app constraints that guarantees that our app can only POST to /update one at a 
time.

However, we are adding a new core (I’ll call it userinfo) to our Solr instance 
that will require multiple clients to be updating a core at the same time.  
Each time a web user logs in to the site, we will /update a record in the 
userinfo core.  We could have, say, 100 users all updating the same core at the 
same time.  It’s also possible that there could be two clients updating the 
same record in the userinfo core at the same time.

My questions:

1) Are there any limits as to how many clients can post to 
/solr/userinfo/update at once?

2) Are there are problems with multiple clients trying to update the same 
record at the same time?  Will Solr just handle the requests sequentially, and 
the last client POSTing is the one that “wins”?  (I’m talking about updating 
the entire record, not doing partial updates at the field level)

3) Is there any way we could corrupt our Solr core through /update POSTs?

My assumption is that the answers are “No, this is safe to do.”  However, I 
can’t find anything in the docs that explicitly say that.  I also can’t find 
anything in the docs saying “Don’t do that.”  We want to make sure before we 
move forward.

Can someone please help point to something to address these questions?

Thanks,
Andy

Re: Solr | OOM | Lazyfield

2021-09-07 Thread Shawn Heisey

On 9/7/2021 4:00 AM, HariBabu kuruva wrote:

We are getting OOM errors in the solr logs for only specific solr stores.

And in the solr logs we see the below error. Is the OOM error could be
because of the below error.



There are precisely two ways to deal with OOME.  One is to increase the 
size of the resource that has been depleted.  The other is to change 
something so less of that resource is required.  Very frequently it is 
not possible to accomplish the second option.  Increasing the resource 
is very often the only solution.


Note that it might not actually be memory that is being depleted.  Java 
throws OOME for several different resource exhaustion scenarios.  Some 
examples of things that might run out before memory are max processes 
per user or max open files.


You haven't shown us the OOME error, so we cannot advise you about what 
you need to do.  Assuming that it is actually memory that is depleted... 
Out of the box, Solr's max heap defaults to 512MB.  This is VERY small 
and almost every user will need to increase it.  We made the default 
heap small so that Solr would start on just about any hardware without 
changing the config.


It is very unlikely that the place in the code where the OOME occurred 
will reveal anything useful.  We just want to see it so we can see the 
message logged at the beginning.  Also, any other errors you are seeing 
are likely unrelated to the OOME.


If you're running Solr on a non-windows system, the bin/solr script 
starts Solr with a Java option that causes Solr to kill itself when OOME 
occurs.  It does this to protect itself -- Java program operation after 
OOME is completely unpredictable and in the case of Solr/Lucene, could 
corrupt the index.  We haven't yet done this for Windows.


Thanks,
Shawn



Re: Index dependent groups of data

2021-09-07 Thread Shawn Heisey

On 9/7/2021 10:01 AM, lstusr 5u93n4 wrote:

Seems like our experimentation is showing that it doesn't at least for TLOG
replica types. If we bound the query to the leaders, we can get accurate
results immediately after the commit. If we don't add that restriction,
sometimes the results sometimes won't show the groups of data that were
indexed in the previous.



Info you might already know:  TLOG (and PULL) replicas do not index, 
unless a TLOG replica is the leader, in which case it behaves exactly 
like NRT.  A PULL replica can never become leader.


When you have TLOG or PULL replicas, Solr is only going to do indexing 
on the shard leaders.  When a commit finishes, it should be done on all 
cores that participate in indexing.


Replication of the completed index segments to TLOG and PULL replicas 
will happen AFTER the commit is done, not concurrently.  I don't think 
there's a reliable way of asking Solr to tell you when all replications 
are complete.


If all replicas were NRT, then I think you wouldn't have this problem.  
But indexing is slower, because all replicas are going to do it, mostly 
concurrently.  In some cases the slowdown might be significant.


Does your "query only the leaders" code check clusterstate in ZK to 
figure out which replicas are leader?  Leaders can change in response to 
problems.


Thanks,
Shawn



Re: Index dependent groups of data

2021-09-07 Thread Shawn Heisey

On 9/7/2021 3:08 PM, Shawn Heisey wrote:
I don't think there's a reliable way of asking Solr to tell you when 
all replications are complete. 



You could use the replication handler (/solr/corename/replication) to 
gather this info and compare info from the leader index with info from 
the follower index(es).  For this to be reliable, you would need to 
check clusterstate in ZK so you're absolutely sure which cores are 
leaders.  I do not know off the top of my head what parameters need to 
be sent to the replication handler to gather that info.


Thanks,
Shawn



Issue with MMapDirectory

2021-09-07 Thread Roman Voronin
Hi All,

Recently one of our solr nodes thrown an error:

...
2021-09-07 00:15:03.258 ERROR (qtp1278677872-985164)
[c:userRequestResults_----0001 s:shard1
r:core_node7
x:userRequestResults_----0001_shard1_replica_n4]
o.a.s.c.SolrCore java.lang.IllegalArgumentException: Unknown directory:
NRTCachingDirectory(MMapDirectory@/data/solrhome/userRequestResults_----0001_shard1_replica_n4/data/snapshot_metadata
lockFactory=org.apache.lucene.store.NativeFSLockFactory@4489a5c4;
maxCacheMB=48.0 maxMergeSizeMB=4.0) {}
  at
org.apache.solr.core.CachingDirectoryFactory.release(CachingDirectoryFactory.java:427)
  at org.apache.solr.core.SolrCore.close(SolrCore.java:1654)
  at org.apache.solr.servlet.HttpSolrCall.destroy(HttpSolrCall.java:654)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:442)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351)
  at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
...

Further work with "unknown" collection was not possible until the node
restarts.

This error doesn't new to us. Some time ago, we tried to use the /export
handler, and while export about 200-300K docs per query from time to time
faced this error on a random node. But until recently, queries with /select
handlers were always executed without problems.

We are quite newbies in solr usage, and suppose it's a consequence of
incorrect solr configuration. Can you help with understanding this problem?
According to search results, MMapDirectory requires as much as possible
free RAM, but attempts to increase free memory (+10Gb) do nothing

Our cluster specs:
8 machines with 40Gb RAM, 500 SSD.
Solr version 8.3.1
Solr heap 31Gb
Docs count ~15M
Collections ~350
Total index size per machine ~150Gb
One part of the collections is configured with 4 shards and 2 replicas and
the other with 1 shard and 8 replicas (we do join operation with them).

Any help you could provide would be much appreciated,
Roman