Re: Commit strategy for Heavy Bulk Indexing into solr
There were some big changes related to child indexing in solr 8.8, under this ticket: https://issues.apache.org/jira/browse/SOLR-14923 It's worth updating solr to latest 8.8 and trying again, perhaps your indexing issue has already been fixed. On 2021/07/27 19:44:13, Pratik Patel wrote: > So it looks like I have narrowed down where the problem is and have also > found a workaround but I would like to understand more. > > As I had mentioned, we have two stages in our bulk indexing operation. > > stage 1 : index Article documents [A1, A2.An] > stage 2 : index Article documents with children [A1 with children, A2 with > children..An with children] > > We were always running into issues in stage 2. > After some time in stage 2, *solrClient.add( , > commitWithin )* starts to timeout and then these timeouts happen > consistently. Even the socketTimeout of 30 mins was exceeded by add call > and we got socketTimeoutException. > > We have set commitWithin to be 6 hours to avoid unnecessary soft commits. > Auto commit interval is 1 min with openSearcher=false and autoSoftCommit > interval is 5 min. > > As mentioned above, we first index just the Articles in stage 1 and then in > stage 2, the same set of Articles are indexed with children (block join). I > had a suspicion that the huge amount of time taken by *solrClient.add* call > can have something to do with the *block join updates *that take place in > stage 2. Adding fresh joins of Articles with children on an empty > collection was much faster and ran without SocketTimeout. So I modified our > indexing pipeline to be as follows. > > 1. stage 1 : index Article documents [A1, A2.An] > 2. delete all the Article documents > 3. stage 2 : index Article documents with children [A1 with children, A2 > with children..An with children] > > With this change, stage 2 would be a simple *add operation and not an > update operation.* I tested the bulk indexing with this change and it > finished successfully without any issues in a shorter time period! > > It will be very helpful to know what is the difference between > A: When we add a document with children when collection does not already > have the same document > B: When we add a document with children when collection already has the > same document without children > > I understand that *update *takes place in B but how can we explain such a > difference in performance between A and B. > > Please note that we use RxJava and call solrClient.add() in parallel > threads with a set of Article documents and the socketTimeout issue seems > to pop up after we have already indexed about 90% of the documents. > > Some more clarity on what could be happening will be very useful. > > Thanks > > On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel wrote: > > > Hi All, > > > > *tl;dr* : running into long GC pauses and solr client socket timeouts > > when indexing bulk of documents into solr. Commit strategy in essence is to > > do hard commits at the interval of 50k documents (maxDocs=50k) and disable > > soft commit altogether during bulk indexing. Simple solr cloud set up with > > one node and one shard. > > > > *Details*: > > We have about 6 million documents which we are trying to index into solr. > > From these, about 500k documents have a text field which holds Abstracts of > > scientific papers/Articles. We extract keywords from these Abstracts and we > > index these keywords as well into solr. > > > > We have a many to many kind of relationship between Articles and keywords. > > To store this, we have following structure. > > > > Article documents > > Keyword documents > > Article-Keyword Join documents > > > > We use block join to index Articles with "Article-Keyword" join documents > > and Keyword documents are indexed independently. > > > > In other words, we have blocks of "Article + Article-Keyword Joins" and we > > have Keyword documents(they hold some additional metadata about keyword ). > > > > We have a bulk processing operation which creates these documents and > > indexes them into solr. During this bulk indexing, we don't need documents > > to be searchable. We need to search against them only after ALL the > > documents are indexed. > > > > *Based on this, this is our current strategy. * > > Soft commits are disabled and Hard commits are done at an interval of 50k > > documents with openSearcher=false. Our code triggers explicit commits 4 > > times after various stages of bulk indexing. Transaction logs are enabled > > and have default settings. > > > > > > ${solr.autoCommit.maxTime:-1} > > ${solr.autoCommit.maxDocs:5} > > false > > > > > > > > ${solr.autoSoftCommit.maxTime:-1} > > > > > > Other Environmental Details: > > Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and > > zkClienttimeout=2 mins > > Our indexing operation triggers many "add" operations in parallel using > > RxJava (15 to 30 threads) each "add" operation is p
Re: Commit strategy for Heavy Bulk Indexing into solr
Thanks Endika! https://issues.apache.org/jira/browse/SOLR-14923 @DavidSmiley do you think this could be related to the issue I have described? I will certainly update our solr image but it will be good to know the root cause of the issue. Your comment on this would be very helpful. Thanks On Wed, Jul 28, 2021 at 7:16 AM Endika Posadas wrote: > There were some big changes related to child indexing in solr 8.8, under > this ticket: https://issues.apache.org/jira/browse/SOLR-14923 > It's worth updating solr to latest 8.8 and trying again, perhaps your > indexing issue has already been fixed. > > On 2021/07/27 19:44:13, Pratik Patel wrote: > > So it looks like I have narrowed down where the problem is and have also > > found a workaround but I would like to understand more. > > > > As I had mentioned, we have two stages in our bulk indexing operation. > > > > stage 1 : index Article documents [A1, A2.An] > > stage 2 : index Article documents with children [A1 with children, A2 > with > > children..An with children] > > > > We were always running into issues in stage 2. > > After some time in stage 2, *solrClient.add( , > > commitWithin )* starts to timeout and then these timeouts happen > > consistently. Even the socketTimeout of 30 mins was exceeded by add call > > and we got socketTimeoutException. > > > > We have set commitWithin to be 6 hours to avoid unnecessary soft commits. > > Auto commit interval is 1 min with openSearcher=false and autoSoftCommit > > interval is 5 min. > > > > As mentioned above, we first index just the Articles in stage 1 and then > in > > stage 2, the same set of Articles are indexed with children (block > join). I > > had a suspicion that the huge amount of time taken by *solrClient.add* > call > > can have something to do with the *block join updates *that take place in > > stage 2. Adding fresh joins of Articles with children on an empty > > collection was much faster and ran without SocketTimeout. So I modified > our > > indexing pipeline to be as follows. > > > > 1. stage 1 : index Article documents [A1, A2.An] > > 2. delete all the Article documents > > 3. stage 2 : index Article documents with children [A1 with children, A2 > > with children..An with children] > > > > With this change, stage 2 would be a simple *add operation and not an > > update operation.* I tested the bulk indexing with this change and it > > finished successfully without any issues in a shorter time period! > > > > It will be very helpful to know what is the difference between > > A: When we add a document with children when collection does not already > > have the same document > > B: When we add a document with children when collection already has the > > same document without children > > > > I understand that *update *takes place in B but how can we explain such a > > difference in performance between A and B. > > > > Please note that we use RxJava and call solrClient.add() in parallel > > threads with a set of Article documents and the socketTimeout issue seems > > to pop up after we have already indexed about 90% of the documents. > > > > Some more clarity on what could be happening will be very useful. > > > > Thanks > > > > On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel > wrote: > > > > > Hi All, > > > > > > *tl;dr* : running into long GC pauses and solr client socket timeouts > > > when indexing bulk of documents into solr. Commit strategy in essence > is to > > > do hard commits at the interval of 50k documents (maxDocs=50k) and > disable > > > soft commit altogether during bulk indexing. Simple solr cloud set up > with > > > one node and one shard. > > > > > > *Details*: > > > We have about 6 million documents which we are trying to index into > solr. > > > From these, about 500k documents have a text field which holds > Abstracts of > > > scientific papers/Articles. We extract keywords from these Abstracts > and we > > > index these keywords as well into solr. > > > > > > We have a many to many kind of relationship between Articles and > keywords. > > > To store this, we have following structure. > > > > > > Article documents > > > Keyword documents > > > Article-Keyword Join documents > > > > > > We use block join to index Articles with "Article-Keyword" join > documents > > > and Keyword documents are indexed independently. > > > > > > In other words, we have blocks of "Article + Article-Keyword Joins" > and we > > > have Keyword documents(they hold some additional metadata about > keyword ). > > > > > > We have a bulk processing operation which creates these documents and > > > indexes them into solr. During this bulk indexing, we don't need > documents > > > to be searchable. We need to search against them only after ALL the > > > documents are indexed. > > > > > > *Based on this, this is our current strategy. * > > > Soft commits are disabled and Hard commits are done at an interval of > 50k > > > documents with openSearcher=false. Our code triggers explicit comm
Quick Query Question: "body":""
Hello, Some documents in my collection have an empty body field. "body":"", I am looking for a query to find docs with a body field with this "empty" value. Normally, I might run with a filter fq=-body:* On this large set of shards that I am querying it times out with the wildcard. On smaller sets, it shows when there is no body field. Other attempts at representing a value of "" in the filter query have not worked. Any tips? Thanks, Matthew
Re: Quick Query Question: "body":""
Search for *:* -body:* I do this pretty often. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jul 28, 2021, at 8:48 AM, mtn search wrote: > > Hello, > > Some documents in my collection have an empty body field. >"body":"", > > I am looking for a query to find docs with a body field with this "empty" > value. > > Normally, I might run with a filter >fq=-body:* > > On this large set of shards that I am querying it times out with the > wildcard. On smaller sets, it shows when there is no body field. > > Other attempts at representing a value of "" in the filter query have not > worked. > > Any tips? > > Thanks, > Matthew
Re: Quick Query Question: "body":""
This (present/absent) condition is there during the indexing time. So, apply the optimization at the indexing rather than at the search time. Have a custom UpdateRequestProcessor chain that will create a boolean flag that matches "body" presence/absence. You can do that with copy, default-value and regex I think, though there may be more elegant combinations. The goal is that by the time that copied field hits the schema, it matches a boolean field definition for most efficiency. Regards, Alex. On Wed, 28 Jul 2021 at 11:55, mtn search wrote: > > Hello, > > Some documents in my collection have an empty body field. > "body":"", > > I am looking for a query to find docs with a body field with this "empty" > value. > > Normally, I might run with a filter > fq=-body:* > > On this large set of shards that I am querying it times out with the > wildcard. On smaller sets, it shows when there is no body field. > > Other attempts at representing a value of "" in the filter query have not > worked. > > Any tips? > > Thanks, > Matthew
Re: Help with unsubscribe because automated didn't work
Same thing is happening to me. Could you please also unsubscribe car...@rocha.cc On Tue, Jul 27, 2021 at 5:26 PM Anshum Gupta wrote: > Done. > > On Tue, Jul 27, 2021 at 1:36 PM Jagpreet Mahajan < > jagpreetkhan...@gmail.com> > wrote: > > > Can the same be done for my email address jagpreetkhan...@gmail.com as > > well please > > > > Thanks > > > > Sent from my iPhone > > > > On Jul 27, 2021, at 20:36, Anshum Gupta wrote: > > > > Hi Katie, > > > > I've unsubscribed those three addresses from the users@solr mailing > list. > > Please reach out if you continue to receive emails. > > > > On Tue, Jul 27, 2021 at 12:17 PM kmccork > wrote: > > > > > Hello, > > > > > > I would like to unsubscribe kmcc...@u.washington.edu, kmcc...@uw.edu, > > > kmcc...@uw.cse.edu from the listserv. Any email with "kmccok@". > > > > > > Hello, > > > > > > I am no longer able to send an email out from those emails, but I would > > > like to leave the forwarding turned on, however the solr listserv is > > > forwarding. I have been getting the solr emails since 2014 and I am no > > > longer interested. > > > > > > I tried unsubscribing from this email by emailing > > > users-unsubscr...@solr.apache.org, however that did not work. > > > > > > Thank you so much, > > > Katie > > > > > > > > -- > Anshum Gupta >
Re: Configuring Solr JSON logs output to file using JsonLayout in log4j2.xml file
Hi, Alex. I think, first of all, check the log with prop log4j.debug=true. Does solr's log4j see your jsob layout plugin? Make shure, that you layout inter solr classpath (in web-inf/lib or solr/server/ext/lib) вторник, 27 июля 2021г., 18:27 +03:00 от Alexey Murz Korepov mur...@gmail.com : >Hello, does anyone have a working Solr instance with JSON log format, using >JsonLayout in log4j2.xml? Please share your configuration! > >I have jackson-core and other jackson packages inside the Solr folder: >./server/solr-webapp/webapp/WEB-INF/lib/jackson-core-2.11.2.jar >./server/solr-webapp/webapp/WEB-INF/lib/jackson-databind-2.11.2.jar >./server/solr-webapp/webapp/WEB-INF/lib/jackson-annotations-2.11.2.jar >./server/solr-webapp/webapp/WEB-INF/lib/jackson-dataformat-smile-2.11.2.jar > >But the log file isn't even created, and I don't even see any errors about >this in the output! > >If I replace back JsonLayout to PatternLayout - all becomes work well. > >I have found a similar problem in the mail list here >https://www.mail-archive.com/solr-user@lucene.apache.org/msg152191.html but >it still without a solution. > >Can anybody help me with this? Maybe I need to add some dependencies >manually in some Solr config file, or copy libraries files to some >other folder? Thanks! > >-- >Best regards, >Alexey Murz Korepov. >E-mail: mur...@gmail.com >Messengers: Matrix - https://matrix.to/#/@murz:ru-matrix.org Telegram - >@MurzNN
Re: Quick Query Question: "body":""
Thanks Walter, Alex! Yes I regularly use - Search for *:* -body:* . With the size of the Master/Slave deployment and number of shards, in this case the wildcard query timesout... I plan to add some additional fqs, to narrow the scope. I do not have an immediate option to change the indexing process/schema, but that is a good idea Alex. Matthew On Wed, Jul 28, 2021 at 10:10 AM Alexandre Rafalovitch wrote: > This (present/absent) condition is there during the indexing time. So, > apply the optimization at the indexing rather than at the search time. > > Have a custom UpdateRequestProcessor chain that will create a boolean > flag that matches "body" presence/absence. > > You can do that with copy, default-value and regex I think, though > there may be more elegant combinations. The goal is that by the time > that copied field hits the schema, it matches a boolean field > definition for most efficiency. > > Regards, >Alex. > > On Wed, 28 Jul 2021 at 11:55, mtn search wrote: > > > > Hello, > > > > Some documents in my collection have an empty body field. > > "body":"", > > > > I am looking for a query to find docs with a body field with this "empty" > > value. > > > > Normally, I might run with a filter > > fq=-body:* > > > > On this large set of shards that I am querying it times out with the > > wildcard. On smaller sets, it shows when there is no body field. > > > > Other attempts at representing a value of "" in the filter query have not > > worked. > > > > Any tips? > > > > Thanks, > > Matthew >
Re: Quick Query Question: "body":""
I wonder if by adding boolean docvalue to schema, you could use in-place updates to add that information post-indexing by basically running batch checks on documents that don't have that new flag at all and then updating it to be 0/1 to indicate body. The in-place update would avoid having to reindex the data from scratch. Just brainstorming some other ways to the same "make it boolean" target, never tried it. Regards, Alex. On Wed, 28 Jul 2021 at 13:49, mtn search wrote: > > Thanks Walter, Alex! > > Yes I regularly use - Search for *:* -body:* . With the size of the > Master/Slave deployment and number of shards, in this case the wildcard > query timesout... I plan to add some additional fqs, to narrow the scope. > > I do not have an immediate option to change the indexing process/schema, > but that is a good idea Alex. > > Matthew > > > On Wed, Jul 28, 2021 at 10:10 AM Alexandre Rafalovitch > wrote: > > > This (present/absent) condition is there during the indexing time. So, > > apply the optimization at the indexing rather than at the search time. > > > > Have a custom UpdateRequestProcessor chain that will create a boolean > > flag that matches "body" presence/absence. > > > > You can do that with copy, default-value and regex I think, though > > there may be more elegant combinations. The goal is that by the time > > that copied field hits the schema, it matches a boolean field > > definition for most efficiency. > > > > Regards, > >Alex. > > > > On Wed, 28 Jul 2021 at 11:55, mtn search wrote: > > > > > > Hello, > > > > > > Some documents in my collection have an empty body field. > > > "body":"", > > > > > > I am looking for a query to find docs with a body field with this "empty" > > > value. > > > > > > Normally, I might run with a filter > > > fq=-body:* > > > > > > On this large set of shards that I am querying it times out with the > > > wildcard. On smaller sets, it shows when there is no body field. > > > > > > Other attempts at representing a value of "" in the filter query have not > > > worked. > > > > > > Any tips? > > > > > > Thanks, > > > Matthew > >
Re: MultipleAdditiveTreeModel
Hi Alessandro, Roopa, I created the ticket here: https://issues.apache.org/jira/browse/SOLR-15569 . I don't think I have permission to add people though, so please tag whomever you feel is necessary. Pls let me know if you need any more info, thanks! On Tue, Jul 27, 2021 at 1:00 PM Alessandro Benedetti wrote: > Hi Spyros, Roopa, > if you can create the Jira ticket with all the details you gathered, that > would be much appreciated. > If you tag me, Christine Poerschke, and Diego Ceccarelli at least, we'll > take over from there! > Thanks! > -- > Alessandro Benedetti > Apache Lucene/Solr Committer > Director, R&D Software Engineer, Search Consultant > > www.sease.io > > > On Mon, 26 Jul 2021 at 21:29, Spyros Kapnissis wrote: > > > Hi Alessandro, Roopa, I also agree that this issue should be further > > investigated and fixed. Please let me know if you need any help opening > the > > Jira ticket and provide more details. > > > > On Mon, Jul 26, 2021, 21:04 Roopa Rao wrote: > > > > > Hi Alessandro, > > > I haven't created JIRA for this, we solved this the similar way that > > Spyros > > > described, by changing the threshold in the model. > > > Ya it would be good to understand why there is the SLACK added. > > > > > > Thanks, > > > Roopa > > > > > > On Mon, Jul 26, 2021 at 10:52 AM Alessandro Benedetti < > > > a.benede...@sease.io> > > > wrote: > > > > > > > I didn't get any additional notification (or maybe I missed it). > > > > Has the Jira been created yet? > > > > Boolean features are quite common around Learning To Rank use cases. > > > > I do believe this contribution can be useful. > > > > If you don't have time to create the Jira or contribute the pull > > request, > > > > no worries, just let us know and we (committers) will organize to do > > it. > > > > Thanks for your help. without the effort of our users, Apache Solr > > > wouldn't > > > > be the same. > > > > Cheers > > > > -- > > > > Alessandro Benedetti > > > > Apache Lucene/Solr Committer > > > > Director, R&D Software Engineer, Search Consultant > > > > > > > > www.sease.io > > > > > > > > > > > > On Fri, 16 Jul 2021 at 20:29, Roopa Rao wrote: > > > > > > > > > Spyros, thank you for verifying this, we are planning to do > something > > > > > similar. > > > > > > > > > > Thanks, > > > > > Roopa > > > > > > > > > > On Fri, Jul 16, 2021 at 12:09 PM Spyros Kapnissis < > ska...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > Just to verify this, we had come across the exact same issue when > > > > > > converting an XGBoost model to MUltipleAdditiveTrees. This was an > > > issue > > > > > > specifically with the categorical features that take on integer > > > values. > > > > > We > > > > > > ended up subtracting 0.5 from the threshold value on any such > split > > > > point > > > > > > on the converted model, so that it would output the same score as > > the > > > > > input > > > > > > model. > > > > > > > > > > > > On Fri, Jul 16, 2021, 18:19 Roopa Rao wrote: > > > > > > > > > > > > > Okay, thank you for the input > > > > > > > > > > > > > > Roopa > > > > > > > > > > > > > > On Fri, Jul 16, 2021 at 5:55 AM Alessandro Benedetti < > > > > > > a.benede...@sease.io > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Roopa, > > > > > > > > I was not able to find why that slack was added. > > > > > > > > I am not sure why we would like to change the threshold. > > > > > > > > I would recommend creating a Jira issue and tag at least > > myself, > > > > > > > Christine > > > > > > > > Poerschke and Diego Ceccarelli, so we can discuss and > > potentially > > > > > open > > > > > > a > > > > > > > > pull request. > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Alessandro Benedetti > > > > > > > > Apache Lucene/Solr Committer > > > > > > > > Director, R&D Software Engineer, Search Consultant > > > > > > > > > > > > > > > > www.sease.io > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 15 Jul 2021 at 22:24, Roopa Rao > > > wrote: > > > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > > > > > In LTR for MultipleAdditiveTreeModel what is the purpose of > > > > adding > > > > > > > > > NODE_SPLIT_SLACK > > > > > > > > > to the threshold? > > > > > > > > > > > > > > > > > > Reference: > > org.apache.solr.ltr.model.MultipleAdditiveTreesModel > > > > > > > > > > > > > > > > > > private static final float NODE_SPLIT_SLACK = 1E-6f; > > > > > > > > > > > > > > > > > > > > > > > > > > > public void setThreshold(float threshold) { this.threshold > = > > > > > > threshold > > > > > > > + > > > > > > > > > NODE_SPLIT_SLACK; } > > > > > > > > > > > > > > > > > > We have a feature which can return 0.0 or 1.0 > > > > > > > > > > > > > > > > > > And model with this tree: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > is_xyz_feature,threshold=0.999
Re: CVE-2021-27905 Apache Solr ReplicationHandler/SSRF vulnerability
Digging out this old thread since I am looking for an answer to the same question. To Matthew's response above, since the /replication is an implicit handler, even if removed from solrconfig.xml, it would still work. I looked around (aka Googled) to find a way in which someone exploited this vulnerability, but couldn't find it. That would help us get an idea about patching it. If anyone knows more about this CVE or can point me to JIRA for the same, that would be great. Thanks, Rahul On Fri, Jun 18, 2021 at 9:47 AM matthew sporleder wrote: > I believe these are all related to exposed api/admin endpoints so your > network is probably protecting you but poor input sanitation could > expose you, of course- like > /myappsearch?search=../../replication?evilpayload (classic sql-style > injection style) > > If you have, literally, removed the handlers for those url endpoints > from your config I think you are pretty safe. > > On Fri, Jun 18, 2021 at 6:54 AM Anchal Sharma2 > wrote: > > > > Hi All, > > > > We are currently using Solr Cloud(solr version 8.6.3) in our application > .Since it doesn't use master-slave solr approach we do not have replication > handler set up (to replicate master to slave)set up on any of our solr > nodes. > > Could some one please confirm ,if following vulnerability is still > applicable for us? > > > > CVE-2021-27905 Apache Solr ReplicationHandler/SSRF vulnerability > > Description: A critical vulnerability was found in Apache Solr up to > 8.8.1 (CVSS 9.8). Affected by this vulnerability is an unknown code block > of the file /replication; the manipulation of the argument > masterUrl/leaderUrl with an unknown input can lead to a privilege > escalation vulnerability. *Note: There are now POCs targeting > CVE-2021-27905 (Apache Solr <= 8.8.1 SSRF), CVE-2017-12629 (Remote Code > Execution via SSRF), and CVE-2019-0193 (DataImportHandler). There are also > Metasploit modules for the Apache Solr Velocity RCE, and two Apache OFBiz > vulnerabilities. Given the number of vulnerabilities, severity, and > availability of POCs, it is highly recommended that any vulnerable systems > be patched as soon as possible. > > > > Thanks > > Anchal Sharma >
Re: MultipleAdditiveTreeModel
Thank you, Spyros. Roopa On Wed, Jul 28, 2021 at 3:00 PM Spyros Kapnissis wrote: > Hi Alessandro, Roopa, I created the ticket here: > https://issues.apache.org/jira/browse/SOLR-15569 . I don't think I have > permission to add people though, so please tag whomever you feel is > necessary. > Pls let me know if you need any more info, thanks! > > On Tue, Jul 27, 2021 at 1:00 PM Alessandro Benedetti > > wrote: > > > Hi Spyros, Roopa, > > if you can create the Jira ticket with all the details you gathered, that > > would be much appreciated. > > If you tag me, Christine Poerschke, and Diego Ceccarelli at least, we'll > > take over from there! > > Thanks! > > -- > > Alessandro Benedetti > > Apache Lucene/Solr Committer > > Director, R&D Software Engineer, Search Consultant > > > > www.sease.io > > > > > > On Mon, 26 Jul 2021 at 21:29, Spyros Kapnissis wrote: > > > > > Hi Alessandro, Roopa, I also agree that this issue should be further > > > investigated and fixed. Please let me know if you need any help opening > > the > > > Jira ticket and provide more details. > > > > > > On Mon, Jul 26, 2021, 21:04 Roopa Rao wrote: > > > > > > > Hi Alessandro, > > > > I haven't created JIRA for this, we solved this the similar way that > > > Spyros > > > > described, by changing the threshold in the model. > > > > Ya it would be good to understand why there is the SLACK added. > > > > > > > > Thanks, > > > > Roopa > > > > > > > > On Mon, Jul 26, 2021 at 10:52 AM Alessandro Benedetti < > > > > a.benede...@sease.io> > > > > wrote: > > > > > > > > > I didn't get any additional notification (or maybe I missed it). > > > > > Has the Jira been created yet? > > > > > Boolean features are quite common around Learning To Rank use > cases. > > > > > I do believe this contribution can be useful. > > > > > If you don't have time to create the Jira or contribute the pull > > > request, > > > > > no worries, just let us know and we (committers) will organize to > do > > > it. > > > > > Thanks for your help. without the effort of our users, Apache Solr > > > > wouldn't > > > > > be the same. > > > > > Cheers > > > > > -- > > > > > Alessandro Benedetti > > > > > Apache Lucene/Solr Committer > > > > > Director, R&D Software Engineer, Search Consultant > > > > > > > > > > www.sease.io > > > > > > > > > > > > > > > On Fri, 16 Jul 2021 at 20:29, Roopa Rao wrote: > > > > > > > > > > > Spyros, thank you for verifying this, we are planning to do > > something > > > > > > similar. > > > > > > > > > > > > Thanks, > > > > > > Roopa > > > > > > > > > > > > On Fri, Jul 16, 2021 at 12:09 PM Spyros Kapnissis < > > ska...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > Just to verify this, we had come across the exact same issue > when > > > > > > > converting an XGBoost model to MUltipleAdditiveTrees. This was > an > > > > issue > > > > > > > specifically with the categorical features that take on integer > > > > values. > > > > > > We > > > > > > > ended up subtracting 0.5 from the threshold value on any such > > split > > > > > point > > > > > > > on the converted model, so that it would output the same score > as > > > the > > > > > > input > > > > > > > model. > > > > > > > > > > > > > > On Fri, Jul 16, 2021, 18:19 Roopa Rao > wrote: > > > > > > > > > > > > > > > Okay, thank you for the input > > > > > > > > > > > > > > > > Roopa > > > > > > > > > > > > > > > > On Fri, Jul 16, 2021 at 5:55 AM Alessandro Benedetti < > > > > > > > a.benede...@sease.io > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Roopa, > > > > > > > > > I was not able to find why that slack was added. > > > > > > > > > I am not sure why we would like to change the threshold. > > > > > > > > > I would recommend creating a Jira issue and tag at least > > > myself, > > > > > > > > Christine > > > > > > > > > Poerschke and Diego Ceccarelli, so we can discuss and > > > potentially > > > > > > open > > > > > > > a > > > > > > > > > pull request. > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Alessandro Benedetti > > > > > > > > > Apache Lucene/Solr Committer > > > > > > > > > Director, R&D Software Engineer, Search Consultant > > > > > > > > > > > > > > > > > > www.sease.io > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 15 Jul 2021 at 22:24, Roopa Rao > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > > > > > > > In LTR for MultipleAdditiveTreeModel what is the purpose > of > > > > > adding > > > > > > > > > > NODE_SPLIT_SLACK > > > > > > > > > > to the threshold? > > > > > > > > > > > > > > > > > > > > Reference: > > > org.apache.solr.ltr.model.MultipleAdditiveTreesModel > > > > > > > > > > > > > > > > > > > > private static final float NODE_SPLIT_SLACK = 1E-6f; > > > > > > > > > > > > > > > > > > > > > > > > >
[jira] [Commented] (SOLR-15072) Support building and testing Solr on ARM64 architecture
[ https://issues.apache.org/jira/browse/SOLR-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389081#comment-17389081 ] Ganesh Raju commented on SOLR-15072: Any update? > Support building and testing Solr on ARM64 architecture > --- > > Key: SOLR-15072 > URL: https://issues.apache.org/jira/browse/SOLR-15072 > Project: Solr > Issue Type: Improvement >Reporter: liusheng >Priority: Major > > Currently, more and more softwares started to support running on ARM64 > platform. For an example, Hadoop has published ARM64 platform specific > packages: > [https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.0/hadoop-3.3.0-aarch64.tar.gz] > > and also have ARM specific CI job configured: > [https://ci-hadoop.apache.org/job/Hive-trunk-linux-ARM/] > > There are also other projects, such as Spark, Kudu, Hbase.etc now have ARM > support and ARM CI built. It would be good if Solr also is being regularly > tested on ARM64. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Quick Query Question: "body":""
On 7/28/2021 11:48 AM, mtn search wrote: Thanks Walter, Alex! Yes I regularly use - Search for *:* -body:* That syntax, while it works for finding docs where the body field is entirely missing, is not the best option. You'll likely find that this syntax is MUCH faster, and returns identical results: *:* -body:[* TO *] The reason it's faster is that * by itself is a wildcard query, and unless the cardinality of the field is very low, those are extremely inefficient. I would expect a field named "body" to have cardinality in the millions or billions, for sufficiently large indexes. Note that neither of these queries will find docs where the value is the empty string, because I believe that IS matched by either a wildcard query or a range query. Thanks, Shawn
Re: Quick Query Question: "body":""
If ‘body’ field is indexed=true, Shawn’s query should give you results where body=“” as well as where body field doesn’t exist at all. Also, I agree that the format body:[* TO *] is much faster for high cardinality fields (which most likely “body” is). -Rahul On Wed, Jul 28, 2021 at 7:46 PM Shawn Heisey wrote: > On 7/28/2021 11:48 AM, mtn search wrote: > > Thanks Walter, Alex! > > > > Yes I regularly use - Search for *:* -body:* > > That syntax, while it works for finding docs where the body field is > entirely missing, is not the best option. You'll likely find that this > syntax is MUCH faster, and returns identical results: > > *:* -body:[* TO *] > > The reason it's faster is that * by itself is a wildcard query, and > unless the cardinality of the field is very low, those are extremely > inefficient. I would expect a field named "body" to have cardinality in > the millions or billions, for sufficiently large indexes. > > Note that neither of these queries will find docs where the value is the > empty string, because I believe that IS matched by either a wildcard > query or a range query. > > Thanks, > Shawn >
Re: Quick Query Question: "body":""
Minor edit: *if “body” field is indexed=true AND analyzed (i.e. Some text type; not of type “string”). On Wed, Jul 28, 2021 at 9:53 PM Rahul Goswami wrote: > If ‘body’ field is indexed=true, Shawn’s query should give you results > where body=“” as well as where body field doesn’t exist at all. > Also, I agree that the format body:[* TO *] is much faster for high > cardinality fields (which most likely “body” is). > > -Rahul > > On Wed, Jul 28, 2021 at 7:46 PM Shawn Heisey wrote: > >> On 7/28/2021 11:48 AM, mtn search wrote: >> > Thanks Walter, Alex! >> > >> > Yes I regularly use - Search for *:* -body:* >> >> That syntax, while it works for finding docs where the body field is >> entirely missing, is not the best option. You'll likely find that this >> syntax is MUCH faster, and returns identical results: >> >> *:* -body:[* TO *] >> >> The reason it's faster is that * by itself is a wildcard query, and >> unless the cardinality of the field is very low, those are extremely >> inefficient. I would expect a field named "body" to have cardinality in >> the millions or billions, for sufficiently large indexes. >> >> Note that neither of these queries will find docs where the value is the >> empty string, because I believe that IS matched by either a wildcard >> query or a range query. >> >> Thanks, >> Shawn >> >