Re: Commit strategy for Heavy Bulk Indexing into solr

2021-07-28 Thread Endika Posadas
There were some big changes related to child indexing in solr 8.8, under this 
ticket: https://issues.apache.org/jira/browse/SOLR-14923 
It's worth updating solr to latest 8.8 and trying again, perhaps your indexing 
issue has already been fixed.

On 2021/07/27 19:44:13, Pratik Patel  wrote: 
> So it looks like I have narrowed down where the problem is and have also
> found a workaround but I would like to understand more.
> 
> As I had mentioned, we have two stages in our bulk indexing operation.
> 
> stage 1 : index Article documents [A1, A2.An]
> stage 2 : index Article documents with children [A1 with children, A2 with
> children..An with children]
> 
> We were always running into issues in stage 2.
> After some time in stage 2, *solrClient.add( ,
> commitWithin )* starts to timeout and then these timeouts happen
> consistently. Even the socketTimeout of 30 mins was exceeded by add call
> and we got socketTimeoutException.
> 
> We have set commitWithin to be 6 hours to avoid unnecessary soft commits.
> Auto commit interval is 1 min with openSearcher=false and autoSoftCommit
> interval is 5 min.
> 
> As mentioned above, we first index just the Articles in stage 1 and then in
> stage 2, the same set of Articles are indexed with children (block join). I
> had a suspicion that the huge amount of time taken by *solrClient.add* call
> can have something to do with the *block join updates *that take place in
> stage 2. Adding fresh joins of Articles with children on an empty
> collection was much faster and ran without SocketTimeout. So I modified our
> indexing pipeline to be as follows.
> 
> 1. stage 1 : index Article documents [A1, A2.An]
> 2. delete all the Article documents
> 3. stage 2 : index Article documents with children [A1 with children, A2
> with children..An with children]
> 
> With this change, stage 2 would be a simple *add operation and not an
> update operation.* I tested the bulk indexing with this change and it
> finished successfully without any issues in a shorter time period!
> 
> It will be very helpful to know what is the difference between
> A: When we add a document with children when collection does not already
> have the same document
> B: When we add a document with children when collection already has the
> same document without children
> 
> I understand that *update *takes place in B but how can we explain such a
> difference in performance between A and B.
> 
> Please note that we use RxJava and call solrClient.add() in parallel
> threads with a set of Article documents and the socketTimeout issue seems
> to pop up after we have already indexed about 90% of the documents.
> 
> Some more clarity on what could be happening will be very useful.
> 
> Thanks
> 
> On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel  wrote:
> 
> > Hi All,
> >
> > *tl;dr* : running into long GC pauses and solr client socket timeouts
> > when indexing bulk of documents into solr. Commit strategy in essence is to
> > do hard commits at the interval of 50k documents (maxDocs=50k) and disable
> > soft commit altogether during bulk indexing. Simple solr cloud set up with
> > one node and one shard.
> >
> > *Details*:
> > We have about 6 million documents which we are trying to index into solr.
> > From these, about 500k documents have a text field which holds Abstracts of
> > scientific papers/Articles. We extract keywords from these Abstracts and we
> > index these keywords as well into solr.
> >
> > We have a many to many kind of relationship between Articles and keywords.
> > To store this, we have following structure.
> >
> > Article documents
> > Keyword documents
> > Article-Keyword Join documents
> >
> > We use block join to index Articles with "Article-Keyword" join documents
> > and Keyword documents are indexed independently.
> >
> > In other words, we have blocks of "Article + Article-Keyword Joins" and we
> > have Keyword documents(they hold some additional metadata about keyword ).
> >
> > We have a bulk processing operation which creates these documents and
> > indexes them into solr. During this bulk indexing, we don't need documents
> > to be searchable. We need to search against them only after ALL the
> > documents are indexed.
> >
> > *Based on this, this is our current strategy. *
> > Soft commits are disabled and Hard commits are done at an interval of 50k
> > documents with openSearcher=false. Our code triggers explicit commits 4
> > times after various stages of bulk indexing. Transaction logs are enabled
> > and have default settings.
> >
> > 
> >   ${solr.autoCommit.maxTime:-1}
> >   ${solr.autoCommit.maxDocs:5}
> >   false
> > 
> >
> > 
> >   ${solr.autoSoftCommit.maxTime:-1}
> > 
> >
> > Other Environmental Details:
> > Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and
> > zkClienttimeout=2 mins
> > Our indexing operation triggers many "add" operations in parallel using
> > RxJava (15 to 30 threads) each "add" operation is p

Re: Commit strategy for Heavy Bulk Indexing into solr

2021-07-28 Thread Pratik Patel
Thanks Endika!

https://issues.apache.org/jira/browse/SOLR-14923

@DavidSmiley do you think this could be related to the issue I have
described?

I will certainly update our solr image but it will be good to know the root
cause of the issue. Your comment on this would be very helpful.

Thanks


On Wed, Jul 28, 2021 at 7:16 AM Endika Posadas 
wrote:

> There were some big changes related to child indexing in solr 8.8, under
> this ticket: https://issues.apache.org/jira/browse/SOLR-14923
> It's worth updating solr to latest 8.8 and trying again, perhaps your
> indexing issue has already been fixed.
>
> On 2021/07/27 19:44:13, Pratik Patel  wrote:
> > So it looks like I have narrowed down where the problem is and have also
> > found a workaround but I would like to understand more.
> >
> > As I had mentioned, we have two stages in our bulk indexing operation.
> >
> > stage 1 : index Article documents [A1, A2.An]
> > stage 2 : index Article documents with children [A1 with children, A2
> with
> > children..An with children]
> >
> > We were always running into issues in stage 2.
> > After some time in stage 2, *solrClient.add( ,
> > commitWithin )* starts to timeout and then these timeouts happen
> > consistently. Even the socketTimeout of 30 mins was exceeded by add call
> > and we got socketTimeoutException.
> >
> > We have set commitWithin to be 6 hours to avoid unnecessary soft commits.
> > Auto commit interval is 1 min with openSearcher=false and autoSoftCommit
> > interval is 5 min.
> >
> > As mentioned above, we first index just the Articles in stage 1 and then
> in
> > stage 2, the same set of Articles are indexed with children (block
> join). I
> > had a suspicion that the huge amount of time taken by *solrClient.add*
> call
> > can have something to do with the *block join updates *that take place in
> > stage 2. Adding fresh joins of Articles with children on an empty
> > collection was much faster and ran without SocketTimeout. So I modified
> our
> > indexing pipeline to be as follows.
> >
> > 1. stage 1 : index Article documents [A1, A2.An]
> > 2. delete all the Article documents
> > 3. stage 2 : index Article documents with children [A1 with children, A2
> > with children..An with children]
> >
> > With this change, stage 2 would be a simple *add operation and not an
> > update operation.* I tested the bulk indexing with this change and it
> > finished successfully without any issues in a shorter time period!
> >
> > It will be very helpful to know what is the difference between
> > A: When we add a document with children when collection does not already
> > have the same document
> > B: When we add a document with children when collection already has the
> > same document without children
> >
> > I understand that *update *takes place in B but how can we explain such a
> > difference in performance between A and B.
> >
> > Please note that we use RxJava and call solrClient.add() in parallel
> > threads with a set of Article documents and the socketTimeout issue seems
> > to pop up after we have already indexed about 90% of the documents.
> >
> > Some more clarity on what could be happening will be very useful.
> >
> > Thanks
> >
> > On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel 
> wrote:
> >
> > > Hi All,
> > >
> > > *tl;dr* : running into long GC pauses and solr client socket timeouts
> > > when indexing bulk of documents into solr. Commit strategy in essence
> is to
> > > do hard commits at the interval of 50k documents (maxDocs=50k) and
> disable
> > > soft commit altogether during bulk indexing. Simple solr cloud set up
> with
> > > one node and one shard.
> > >
> > > *Details*:
> > > We have about 6 million documents which we are trying to index into
> solr.
> > > From these, about 500k documents have a text field which holds
> Abstracts of
> > > scientific papers/Articles. We extract keywords from these Abstracts
> and we
> > > index these keywords as well into solr.
> > >
> > > We have a many to many kind of relationship between Articles and
> keywords.
> > > To store this, we have following structure.
> > >
> > > Article documents
> > > Keyword documents
> > > Article-Keyword Join documents
> > >
> > > We use block join to index Articles with "Article-Keyword" join
> documents
> > > and Keyword documents are indexed independently.
> > >
> > > In other words, we have blocks of "Article + Article-Keyword Joins"
> and we
> > > have Keyword documents(they hold some additional metadata about
> keyword ).
> > >
> > > We have a bulk processing operation which creates these documents and
> > > indexes them into solr. During this bulk indexing, we don't need
> documents
> > > to be searchable. We need to search against them only after ALL the
> > > documents are indexed.
> > >
> > > *Based on this, this is our current strategy. *
> > > Soft commits are disabled and Hard commits are done at an interval of
> 50k
> > > documents with openSearcher=false. Our code triggers explicit comm

Quick Query Question: "body":""

2021-07-28 Thread mtn search
Hello,

Some documents in my collection have an empty body field.
"body":"",

I am looking for a query to find docs with a body field with this "empty"
value.

Normally, I might run with a filter
fq=-body:*

On this large set of shards that I am querying it times out with the
wildcard.  On smaller sets, it shows when there is no body field.

Other attempts at representing a value of "" in the filter query have not
worked.

Any tips?

Thanks,
Matthew


Re: Quick Query Question: "body":""

2021-07-28 Thread Walter Underwood
Search for *:* -body:*

I do this pretty often.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 28, 2021, at 8:48 AM, mtn search  wrote:
> 
> Hello,
> 
> Some documents in my collection have an empty body field.
>"body":"",
> 
> I am looking for a query to find docs with a body field with this "empty"
> value.
> 
> Normally, I might run with a filter
>fq=-body:*
> 
> On this large set of shards that I am querying it times out with the
> wildcard.  On smaller sets, it shows when there is no body field.
> 
> Other attempts at representing a value of "" in the filter query have not
> worked.
> 
> Any tips?
> 
> Thanks,
> Matthew



Re: Quick Query Question: "body":""

2021-07-28 Thread Alexandre Rafalovitch
This (present/absent) condition is there during the indexing time. So,
apply the optimization at the indexing rather than at the search time.

Have a custom UpdateRequestProcessor chain that will create a boolean
flag that matches "body" presence/absence.

You can do that with copy, default-value and regex I think, though
there may be more elegant combinations. The goal is that by the time
that copied field hits the schema, it matches a boolean field
definition for most efficiency.

Regards,
   Alex.

On Wed, 28 Jul 2021 at 11:55, mtn search  wrote:
>
> Hello,
>
> Some documents in my collection have an empty body field.
> "body":"",
>
> I am looking for a query to find docs with a body field with this "empty"
> value.
>
> Normally, I might run with a filter
> fq=-body:*
>
> On this large set of shards that I am querying it times out with the
> wildcard.  On smaller sets, it shows when there is no body field.
>
> Other attempts at representing a value of "" in the filter query have not
> worked.
>
> Any tips?
>
> Thanks,
> Matthew


Re: Help with unsubscribe because automated didn't work

2021-07-28 Thread Carlos Rocha
Same thing is happening to me.  Could you please also unsubscribe
car...@rocha.cc

On Tue, Jul 27, 2021 at 5:26 PM Anshum Gupta  wrote:

> Done.
>
> On Tue, Jul 27, 2021 at 1:36 PM Jagpreet Mahajan <
> jagpreetkhan...@gmail.com>
> wrote:
>
> > Can the same be done for my email address jagpreetkhan...@gmail.com as
> > well please
> >
> > Thanks
> >
> > Sent from my iPhone
> >
> > On Jul 27, 2021, at 20:36, Anshum Gupta  wrote:
> >
> > Hi Katie,
> >
> > I've unsubscribed those three addresses from the users@solr mailing
> list.
> > Please reach out if you continue to receive emails.
> >
> > On Tue, Jul 27, 2021 at 12:17 PM kmccork 
> wrote:
> >
> > > Hello,
> > >
> > > I would like to unsubscribe kmcc...@u.washington.edu, kmcc...@uw.edu,
> > > kmcc...@uw.cse.edu from the listserv. Any email with "kmccok@".
> > >
> > > Hello,
> > >
> > > I am no longer able to send an email out from those emails, but I would
> > > like to leave the forwarding turned on, however the solr listserv is
> > > forwarding. I have been getting the solr emails since 2014 and I am no
> > > longer interested.
> > >
> > > I tried unsubscribing from this email by emailing
> > > users-unsubscr...@solr.apache.org, however that did not work.
> > >
> > > Thank you so much,
> > > Katie
> > >
> >
>
>
> --
> Anshum Gupta
>


Re: Configuring Solr JSON logs output to file using JsonLayout in log4j2.xml file

2021-07-28 Thread Alex Bulygin

Hi, Alex.
I think, first of all, check the log with prop log4j.debug=true. Does solr's 
log4j see your jsob layout plugin? 
Make shure, that you layout inter solr classpath (in web-inf/lib or 
solr/server/ext/lib)
вторник, 27 июля 2021г., 18:27 +03:00 от Alexey Murz Korepov  mur...@gmail.com :

>Hello, does anyone have a working Solr instance with JSON log format, using
>JsonLayout in log4j2.xml? Please share your configuration!
>
>I have jackson-core and other jackson packages inside the Solr folder:
>./server/solr-webapp/webapp/WEB-INF/lib/jackson-core-2.11.2.jar
>./server/solr-webapp/webapp/WEB-INF/lib/jackson-databind-2.11.2.jar
>./server/solr-webapp/webapp/WEB-INF/lib/jackson-annotations-2.11.2.jar
>./server/solr-webapp/webapp/WEB-INF/lib/jackson-dataformat-smile-2.11.2.jar
>
>But the log file isn't even created, and I don't even see any errors about
>this in the output!
>
>If I replace back JsonLayout to PatternLayout - all becomes work well.
>
>I have found a similar problem in the mail list here
>https://www.mail-archive.com/solr-user@lucene.apache.org/msg152191.html but
>it still without a solution.
>
>Can anybody help me with this? Maybe I need to add some dependencies
>manually in some Solr config file, or copy libraries files to some
>other folder? Thanks!
>
>-- 
>Best regards,
>Alexey Murz Korepov.
>E-mail:  mur...@gmail.com
>Messengers: Matrix -  https://matrix.to/#/@murz:ru-matrix.org Telegram -
>@MurzNN


Re: Quick Query Question: "body":""

2021-07-28 Thread mtn search
Thanks Walter, Alex!

Yes I regularly use -  Search for *:* -body:* .   With the size of the
Master/Slave deployment and number of shards, in this case the wildcard
query timesout...   I plan to add some additional fqs, to narrow the scope.

I do not have an immediate option to change the indexing process/schema,
but that is a good idea Alex.

Matthew


On Wed, Jul 28, 2021 at 10:10 AM Alexandre Rafalovitch 
wrote:

> This (present/absent) condition is there during the indexing time. So,
> apply the optimization at the indexing rather than at the search time.
>
> Have a custom UpdateRequestProcessor chain that will create a boolean
> flag that matches "body" presence/absence.
>
> You can do that with copy, default-value and regex I think, though
> there may be more elegant combinations. The goal is that by the time
> that copied field hits the schema, it matches a boolean field
> definition for most efficiency.
>
> Regards,
>Alex.
>
> On Wed, 28 Jul 2021 at 11:55, mtn search  wrote:
> >
> > Hello,
> >
> > Some documents in my collection have an empty body field.
> > "body":"",
> >
> > I am looking for a query to find docs with a body field with this "empty"
> > value.
> >
> > Normally, I might run with a filter
> > fq=-body:*
> >
> > On this large set of shards that I am querying it times out with the
> > wildcard.  On smaller sets, it shows when there is no body field.
> >
> > Other attempts at representing a value of "" in the filter query have not
> > worked.
> >
> > Any tips?
> >
> > Thanks,
> > Matthew
>


Re: Quick Query Question: "body":""

2021-07-28 Thread Alexandre Rafalovitch
I wonder if by adding boolean docvalue to schema, you could use
in-place updates to add that information post-indexing by basically
running batch checks on documents that don't have that new flag at all
and then updating it to be 0/1 to indicate body. The in-place update
would avoid having to reindex the data from scratch.

Just brainstorming some other ways to the same "make it boolean"
target, never tried it.

Regards,
   Alex.

On Wed, 28 Jul 2021 at 13:49, mtn search  wrote:
>
> Thanks Walter, Alex!
>
> Yes I regularly use -  Search for *:* -body:* .   With the size of the
> Master/Slave deployment and number of shards, in this case the wildcard
> query timesout...   I plan to add some additional fqs, to narrow the scope.
>
> I do not have an immediate option to change the indexing process/schema,
> but that is a good idea Alex.
>
> Matthew
>
>
> On Wed, Jul 28, 2021 at 10:10 AM Alexandre Rafalovitch 
> wrote:
>
> > This (present/absent) condition is there during the indexing time. So,
> > apply the optimization at the indexing rather than at the search time.
> >
> > Have a custom UpdateRequestProcessor chain that will create a boolean
> > flag that matches "body" presence/absence.
> >
> > You can do that with copy, default-value and regex I think, though
> > there may be more elegant combinations. The goal is that by the time
> > that copied field hits the schema, it matches a boolean field
> > definition for most efficiency.
> >
> > Regards,
> >Alex.
> >
> > On Wed, 28 Jul 2021 at 11:55, mtn search  wrote:
> > >
> > > Hello,
> > >
> > > Some documents in my collection have an empty body field.
> > > "body":"",
> > >
> > > I am looking for a query to find docs with a body field with this "empty"
> > > value.
> > >
> > > Normally, I might run with a filter
> > > fq=-body:*
> > >
> > > On this large set of shards that I am querying it times out with the
> > > wildcard.  On smaller sets, it shows when there is no body field.
> > >
> > > Other attempts at representing a value of "" in the filter query have not
> > > worked.
> > >
> > > Any tips?
> > >
> > > Thanks,
> > > Matthew
> >


Re: MultipleAdditiveTreeModel

2021-07-28 Thread Spyros Kapnissis
Hi Alessandro, Roopa, I created the ticket here:
https://issues.apache.org/jira/browse/SOLR-15569 . I don't think I have
permission to add people though, so please tag whomever you feel is
necessary.
Pls let me know if you need any more info, thanks!

On Tue, Jul 27, 2021 at 1:00 PM Alessandro Benedetti 
wrote:

> Hi Spyros, Roopa,
> if you can create the Jira ticket with all the details you gathered, that
> would be much appreciated.
> If you tag me, Christine Poerschke, and Diego Ceccarelli at least, we'll
> take over from there!
> Thanks!
> --
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io
>
>
> On Mon, 26 Jul 2021 at 21:29, Spyros Kapnissis  wrote:
>
> > Hi Alessandro, Roopa, I also agree that this issue should be further
> > investigated and fixed. Please let me know if you need any help opening
> the
> > Jira ticket and provide more details.
> >
> > On Mon, Jul 26, 2021, 21:04 Roopa Rao  wrote:
> >
> > > Hi Alessandro,
> > > I haven't created JIRA for this, we solved this the similar way that
> > Spyros
> > > described, by changing the threshold in the model.
> > > Ya it would be good to understand why there is the SLACK added.
> > >
> > > Thanks,
> > > Roopa
> > >
> > > On Mon, Jul 26, 2021 at 10:52 AM Alessandro Benedetti <
> > > a.benede...@sease.io>
> > > wrote:
> > >
> > > > I didn't get any additional notification (or maybe I missed it).
> > > > Has the Jira been created yet?
> > > > Boolean features are quite common around Learning To Rank use cases.
> > > > I do believe this contribution can be useful.
> > > > If you don't have time to create the Jira or contribute the pull
> > request,
> > > > no worries, just let us know and we (committers) will organize to do
> > it.
> > > > Thanks for your help. without the effort of our users, Apache Solr
> > > wouldn't
> > > > be the same.
> > > > Cheers
> > > > --
> > > > Alessandro Benedetti
> > > > Apache Lucene/Solr Committer
> > > > Director, R&D Software Engineer, Search Consultant
> > > >
> > > > www.sease.io
> > > >
> > > >
> > > > On Fri, 16 Jul 2021 at 20:29, Roopa Rao  wrote:
> > > >
> > > > > Spyros, thank you for verifying this, we are planning to do
> something
> > > > > similar.
> > > > >
> > > > > Thanks,
> > > > > Roopa
> > > > >
> > > > > On Fri, Jul 16, 2021 at 12:09 PM Spyros Kapnissis <
> ska...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Just to verify this, we had come across the exact same issue when
> > > > > > converting an XGBoost model to MUltipleAdditiveTrees. This was an
> > > issue
> > > > > > specifically with the categorical features that take on integer
> > > values.
> > > > > We
> > > > > > ended up subtracting 0.5 from the threshold value on any such
> split
> > > > point
> > > > > > on the converted model, so that it would output the same score as
> > the
> > > > > input
> > > > > > model.
> > > > > >
> > > > > > On Fri, Jul 16, 2021, 18:19 Roopa Rao  wrote:
> > > > > >
> > > > > > > Okay, thank you for the input
> > > > > > >
> > > > > > > Roopa
> > > > > > >
> > > > > > > On Fri, Jul 16, 2021 at 5:55 AM Alessandro Benedetti <
> > > > > > a.benede...@sease.io
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Roopa,
> > > > > > > > I was not able to find why that slack was added.
> > > > > > > > I am not sure why we would like to change the threshold.
> > > > > > > > I would recommend creating a Jira issue and tag at least
> > myself,
> > > > > > > Christine
> > > > > > > > Poerschke and Diego Ceccarelli, so we can discuss and
> > potentially
> > > > > open
> > > > > > a
> > > > > > > > pull request.
> > > > > > > >
> > > > > > > > Cheers
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Alessandro Benedetti
> > > > > > > > Apache Lucene/Solr Committer
> > > > > > > > Director, R&D Software Engineer, Search Consultant
> > > > > > > >
> > > > > > > > www.sease.io
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, 15 Jul 2021 at 22:24, Roopa Rao 
> > > wrote:
> > > > > > > >
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > > In LTR for MultipleAdditiveTreeModel what is the purpose of
> > > > adding
> > > > > > > > > NODE_SPLIT_SLACK
> > > > > > > > > to the threshold?
> > > > > > > > >
> > > > > > > > > Reference:
> > org.apache.solr.ltr.model.MultipleAdditiveTreesModel
> > > > > > > > >
> > > > > > > > > private static final float NODE_SPLIT_SLACK = 1E-6f;
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > public void setThreshold(float threshold) { this.threshold
> =
> > > > > > threshold
> > > > > > > +
> > > > > > > > > NODE_SPLIT_SLACK; }
> > > > > > > > >
> > > > > > > > > We have a feature which can return 0.0 or 1.0
> > > > > > > > >
> > > > > > > > > And model with this tree:
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> is_xyz_feature,threshold=0.999

Re: CVE-2021-27905 Apache Solr ReplicationHandler/SSRF vulnerability

2021-07-28 Thread Rahul Goswami
Digging out this old thread since I am looking for an answer to the same
question.
To Matthew's response above, since the /replication is an implicit handler,
even if removed from solrconfig.xml, it would still work.
I looked around (aka Googled) to find a way in which someone exploited this
vulnerability, but couldn't find it. That would help us get an idea about
patching it. If anyone knows more about this CVE or can point me to JIRA
for the same, that would be great.

Thanks,
Rahul


On Fri, Jun 18, 2021 at 9:47 AM matthew sporleder 
wrote:

> I believe these are all related to exposed api/admin endpoints so your
> network is probably protecting you but poor input sanitation could
> expose you, of course- like
> /myappsearch?search=../../replication?evilpayload (classic sql-style
> injection style)
>
> If you have, literally, removed the handlers for those url endpoints
> from your config I think you are pretty safe.
>
> On Fri, Jun 18, 2021 at 6:54 AM Anchal Sharma2 
> wrote:
> >
> > Hi All,
> >
> > We are currently using Solr Cloud(solr version 8.6.3) in our application
> .Since it doesn't use master-slave solr approach we do not have replication
> handler set up (to replicate master to slave)set up on any of our solr
> nodes.
> > Could some one please confirm ,if following vulnerability is still
> applicable for us?
> >
> > CVE-2021-27905 Apache Solr ReplicationHandler/SSRF vulnerability
> > Description: A critical vulnerability was found in Apache Solr up to
> 8.8.1 (CVSS 9.8). Affected by this vulnerability is an unknown code block
> of the file /replication; the manipulation of the argument
> masterUrl/leaderUrl with an unknown input can lead to a privilege
> escalation vulnerability.  *Note: There are now POCs targeting
> CVE-2021-27905 (Apache Solr <= 8.8.1 SSRF), CVE-2017-12629 (Remote Code
> Execution via SSRF), and CVE-2019-0193 (DataImportHandler). There are also
> Metasploit modules for the Apache Solr Velocity RCE, and two Apache OFBiz
> vulnerabilities. Given the number of vulnerabilities, severity, and
> availability of POCs, it is highly recommended that any vulnerable systems
> be patched as soon as possible.
> >
> > Thanks
> > Anchal Sharma
>


Re: MultipleAdditiveTreeModel

2021-07-28 Thread Roopa Rao
Thank you, Spyros.

Roopa

On Wed, Jul 28, 2021 at 3:00 PM Spyros Kapnissis  wrote:

> Hi Alessandro, Roopa, I created the ticket here:
> https://issues.apache.org/jira/browse/SOLR-15569 . I don't think I have
> permission to add people though, so please tag whomever you feel is
> necessary.
> Pls let me know if you need any more info, thanks!
>
> On Tue, Jul 27, 2021 at 1:00 PM Alessandro Benedetti  >
> wrote:
>
> > Hi Spyros, Roopa,
> > if you can create the Jira ticket with all the details you gathered, that
> > would be much appreciated.
> > If you tag me, Christine Poerschke, and Diego Ceccarelli at least, we'll
> > take over from there!
> > Thanks!
> > --
> > Alessandro Benedetti
> > Apache Lucene/Solr Committer
> > Director, R&D Software Engineer, Search Consultant
> >
> > www.sease.io
> >
> >
> > On Mon, 26 Jul 2021 at 21:29, Spyros Kapnissis  wrote:
> >
> > > Hi Alessandro, Roopa, I also agree that this issue should be further
> > > investigated and fixed. Please let me know if you need any help opening
> > the
> > > Jira ticket and provide more details.
> > >
> > > On Mon, Jul 26, 2021, 21:04 Roopa Rao  wrote:
> > >
> > > > Hi Alessandro,
> > > > I haven't created JIRA for this, we solved this the similar way that
> > > Spyros
> > > > described, by changing the threshold in the model.
> > > > Ya it would be good to understand why there is the SLACK added.
> > > >
> > > > Thanks,
> > > > Roopa
> > > >
> > > > On Mon, Jul 26, 2021 at 10:52 AM Alessandro Benedetti <
> > > > a.benede...@sease.io>
> > > > wrote:
> > > >
> > > > > I didn't get any additional notification (or maybe I missed it).
> > > > > Has the Jira been created yet?
> > > > > Boolean features are quite common around Learning To Rank use
> cases.
> > > > > I do believe this contribution can be useful.
> > > > > If you don't have time to create the Jira or contribute the pull
> > > request,
> > > > > no worries, just let us know and we (committers) will organize to
> do
> > > it.
> > > > > Thanks for your help. without the effort of our users, Apache Solr
> > > > wouldn't
> > > > > be the same.
> > > > > Cheers
> > > > > --
> > > > > Alessandro Benedetti
> > > > > Apache Lucene/Solr Committer
> > > > > Director, R&D Software Engineer, Search Consultant
> > > > >
> > > > > www.sease.io
> > > > >
> > > > >
> > > > > On Fri, 16 Jul 2021 at 20:29, Roopa Rao  wrote:
> > > > >
> > > > > > Spyros, thank you for verifying this, we are planning to do
> > something
> > > > > > similar.
> > > > > >
> > > > > > Thanks,
> > > > > > Roopa
> > > > > >
> > > > > > On Fri, Jul 16, 2021 at 12:09 PM Spyros Kapnissis <
> > ska...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > Just to verify this, we had come across the exact same issue
> when
> > > > > > > converting an XGBoost model to MUltipleAdditiveTrees. This was
> an
> > > > issue
> > > > > > > specifically with the categorical features that take on integer
> > > > values.
> > > > > > We
> > > > > > > ended up subtracting 0.5 from the threshold value on any such
> > split
> > > > > point
> > > > > > > on the converted model, so that it would output the same score
> as
> > > the
> > > > > > input
> > > > > > > model.
> > > > > > >
> > > > > > > On Fri, Jul 16, 2021, 18:19 Roopa Rao 
> wrote:
> > > > > > >
> > > > > > > > Okay, thank you for the input
> > > > > > > >
> > > > > > > > Roopa
> > > > > > > >
> > > > > > > > On Fri, Jul 16, 2021 at 5:55 AM Alessandro Benedetti <
> > > > > > > a.benede...@sease.io
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Roopa,
> > > > > > > > > I was not able to find why that slack was added.
> > > > > > > > > I am not sure why we would like to change the threshold.
> > > > > > > > > I would recommend creating a Jira issue and tag at least
> > > myself,
> > > > > > > > Christine
> > > > > > > > > Poerschke and Diego Ceccarelli, so we can discuss and
> > > potentially
> > > > > > open
> > > > > > > a
> > > > > > > > > pull request.
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Alessandro Benedetti
> > > > > > > > > Apache Lucene/Solr Committer
> > > > > > > > > Director, R&D Software Engineer, Search Consultant
> > > > > > > > >
> > > > > > > > > www.sease.io
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, 15 Jul 2021 at 22:24, Roopa Rao  >
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > > > In LTR for MultipleAdditiveTreeModel what is the purpose
> of
> > > > > adding
> > > > > > > > > > NODE_SPLIT_SLACK
> > > > > > > > > > to the threshold?
> > > > > > > > > >
> > > > > > > > > > Reference:
> > > org.apache.solr.ltr.model.MultipleAdditiveTreesModel
> > > > > > > > > >
> > > > > > > > > > private static final float NODE_SPLIT_SLACK = 1E-6f;
> > > > > > > > > >
> > > > > > > > > >
> > > > > 

[jira] [Commented] (SOLR-15072) Support building and testing Solr on ARM64 architecture

2021-07-28 Thread Ganesh Raju (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389081#comment-17389081
 ] 

Ganesh Raju commented on SOLR-15072:


Any update?

> Support building and testing Solr on ARM64 architecture
> ---
>
> Key: SOLR-15072
> URL: https://issues.apache.org/jira/browse/SOLR-15072
> Project: Solr
>  Issue Type: Improvement
>Reporter: liusheng
>Priority: Major
>
> Currently, more and more softwares started to support running on ARM64 
> platform.  For an example, Hadoop has published ARM64 platform specific 
> packages:
> [https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.0/hadoop-3.3.0-aarch64.tar.gz]
>  
> and also have ARM specific CI job configured:
> [https://ci-hadoop.apache.org/job/Hive-trunk-linux-ARM/]
>  
> There are also other projects, such as Spark, Kudu, Hbase.etc now have ARM 
> support and ARM CI built. It would be good if Solr also  is being regularly 
> tested on ARM64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Quick Query Question: "body":""

2021-07-28 Thread Shawn Heisey

On 7/28/2021 11:48 AM, mtn search wrote:

Thanks Walter, Alex!

Yes I regularly use -  Search for *:* -body:*


That syntax, while it works for finding docs where the body field is 
entirely missing, is not the best option.  You'll likely find that this 
syntax is MUCH faster, and returns identical results:


*:* -body:[* TO *]

The reason it's faster is that * by itself is a wildcard query, and 
unless the cardinality of the field is very low, those are extremely 
inefficient.  I would expect a field named "body" to have cardinality in 
the millions or billions, for sufficiently large indexes.


Note that neither of these queries will find docs where the value is the 
empty string, because I believe that IS matched by either a wildcard 
query or a range query.


Thanks,
Shawn


Re: Quick Query Question: "body":""

2021-07-28 Thread Rahul Goswami
If ‘body’ field is indexed=true, Shawn’s query should give you results
where body=“” as well as where body field doesn’t exist at all.
Also, I agree that the format body:[* TO *] is much faster for high
cardinality fields (which most likely “body” is).

-Rahul

On Wed, Jul 28, 2021 at 7:46 PM Shawn Heisey  wrote:

> On 7/28/2021 11:48 AM, mtn search wrote:
> > Thanks Walter, Alex!
> >
> > Yes I regularly use -  Search for *:* -body:*
>
> That syntax, while it works for finding docs where the body field is
> entirely missing, is not the best option.  You'll likely find that this
> syntax is MUCH faster, and returns identical results:
>
> *:* -body:[* TO *]
>
> The reason it's faster is that * by itself is a wildcard query, and
> unless the cardinality of the field is very low, those are extremely
> inefficient.  I would expect a field named "body" to have cardinality in
> the millions or billions, for sufficiently large indexes.
>
> Note that neither of these queries will find docs where the value is the
> empty string, because I believe that IS matched by either a wildcard
> query or a range query.
>
> Thanks,
> Shawn
>


Re: Quick Query Question: "body":""

2021-07-28 Thread Rahul Goswami
Minor edit: *if “body” field is indexed=true AND analyzed (i.e. Some text
type; not of type “string”).


On Wed, Jul 28, 2021 at 9:53 PM Rahul Goswami  wrote:

> If ‘body’ field is indexed=true, Shawn’s query should give you results
> where body=“” as well as where body field doesn’t exist at all.
> Also, I agree that the format body:[* TO *] is much faster for high
> cardinality fields (which most likely “body” is).
>
> -Rahul
>
> On Wed, Jul 28, 2021 at 7:46 PM Shawn Heisey  wrote:
>
>> On 7/28/2021 11:48 AM, mtn search wrote:
>> > Thanks Walter, Alex!
>> >
>> > Yes I regularly use -  Search for *:* -body:*
>>
>> That syntax, while it works for finding docs where the body field is
>> entirely missing, is not the best option.  You'll likely find that this
>> syntax is MUCH faster, and returns identical results:
>>
>> *:* -body:[* TO *]
>>
>> The reason it's faster is that * by itself is a wildcard query, and
>> unless the cardinality of the field is very low, those are extremely
>> inefficient.  I would expect a field named "body" to have cardinality in
>> the millions or billions, for sufficiently large indexes.
>>
>> Note that neither of these queries will find docs where the value is the
>> empty string, because I believe that IS matched by either a wildcard
>> query or a range query.
>>
>> Thanks,
>> Shawn
>>
>