[jira] [Created] (SOLR-16360) Atomic update on boolean fields doesn't reflect when value starts with "1", "t" or "T"
Rahul Goswami created SOLR-16360: Summary: Atomic update on boolean fields doesn't reflect when value starts with "1", "t" or "T" Key: SOLR-16360 URL: https://issues.apache.org/jira/browse/SOLR-16360 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Affects Versions: 8.11 Reporter: Rahul Goswami I am running Solr 8.11. As per the Solr documentation, any value starting with "1","t" or "T" for a boolean field is interpreted as true. [https://solr.apache.org/guide/8_11/field-types-included-with-solr.html#recommended-field-types] However, I hit a potential Solr bug where if the String value "1","t" or "T" is passed in an atomic update, it is treated as false. //Eg:Below document is indexed first => query returns "inStock" as true (as expected) { "id":"test", "inStock":"true" } //Follow above update with below atomic update and commit. => inStock becomes false in query result { "id":"test", "inStock":\{"set":"1"} } This doesn't happen though if value "1" is passed in a regular update. Eg:Below update reflects the value of inStock as true when queried. { "id":"test", "inStock":"1" } -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17038) /admin/segments handler: Expose the term count
[ https://issues.apache.org/jira/browse/SOLR-17038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812137#comment-17812137 ] Rahul Goswami commented on SOLR-17038: -- I am working on this. > /admin/segments handler: Expose the term count > -- > > Key: SOLR-17038 > URL: https://issues.apache.org/jira/browse/SOLR-17038 > Project: Solr > Issue Type: Improvement >Reporter: David Smiley >Priority: Minor > Labels: newdev > > The term count for a field is not exposed for diagnostic purposes. Strangely > enough, more obscure statistics like sumDocFreq and sumTotalTermFreq are. > Just need to add a line like: > {quote}fieldFlags.add("termCount", terms.size());{quote} > to SegmentsInfoRequestHandler next to [where those other stats are > gathered|https://github.com/apache/solr/blob/releases/solr/9.4.1/solr/core/src/java/org/apache/solr/handler/admin/SegmentsInfoRequestHandler.java#L371-L372]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-17038) /admin/segments handler: Expose the term count
[ https://issues.apache.org/jira/browse/SOLR-17038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812137#comment-17812137 ] Rahul Goswami edited comment on SOLR-17038 at 1/30/24 4:23 AM: --- I am working on this. Any other stats apart from "termCount" that could be useful ? was (Author: rahul196...@gmail.com): I am working on this. > /admin/segments handler: Expose the term count > -- > > Key: SOLR-17038 > URL: https://issues.apache.org/jira/browse/SOLR-17038 > Project: Solr > Issue Type: Improvement >Reporter: David Smiley >Priority: Minor > Labels: newdev > > The term count for a field is not exposed for diagnostic purposes. Strangely > enough, more obscure statistics like sumDocFreq and sumTotalTermFreq are. > Just need to add a line like: > {quote}fieldFlags.add("termCount", terms.size());{quote} > to SegmentsInfoRequestHandler next to [where those other stats are > gathered|https://github.com/apache/solr/blob/releases/solr/9.4.1/solr/core/src/java/org/apache/solr/handler/admin/SegmentsInfoRequestHandler.java#L371-L372]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-17186) Streaming query breaks if token contains backtick
Rahul Goswami created SOLR-17186: Summary: Streaming query breaks if token contains backtick Key: SOLR-17186 URL: https://issues.apache.org/jira/browse/SOLR-17186 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: streaming expressions Affects Versions: 8.5 Reporter: Rahul Goswami Streaming searches break when the data contains the backtick character( ` ). Eg: [http://host-name:8983/solr/MyCollection/stream?expr=search(MyCollection,q="My_Field:Foto`s",fl="field1",qt="/export")|http://pidx.idcprodcert.loc:2/solr/sharepointindex_036DE237-A69B-4E7E-929E-62C2AB7A7323_multinode/stream?expr=search(sharepointindex_036DE237-A69B-4E7E-929E-62C2AB7A7323_multinode,q=%22slevel_Url_8:Fotos%22,fl=%22contentid%22)] Same search works fine if called directly with /export or /select -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17186) Streaming query breaks if token contains backtick
[ https://issues.apache.org/jira/browse/SOLR-17186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821399#comment-17821399 ] Rahul Goswami commented on SOLR-17186: -- Root cause seems to be replacement of ` with " in StreamExpressionParser introduced in Solr 8.5 (https://issues.apache.org/jira/browse/SOLR-14139) [https://github.com/apache/solr/blob/main/solr/solrj-streaming/src/java/org/apache/solr/client/solrj/io/stream/expr/StreamExpressionParser.java#L138] Will submit a PR. > Streaming query breaks if token contains backtick > - > > Key: SOLR-17186 > URL: https://issues.apache.org/jira/browse/SOLR-17186 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: streaming expressions >Affects Versions: 8.5 >Reporter: Rahul Goswami >Priority: Major > > Streaming searches break when the data contains the backtick character( ` ). > Eg: > [http://host-name:8983/solr/MyCollection/stream?expr=search(MyCollection,q="My_Field:Foto`s",fl="field1",qt="/export")|http://pidx.idcprodcert.loc:2/solr/sharepointindex_036DE237-A69B-4E7E-929E-62C2AB7A7323_multinode/stream?expr=search(sharepointindex_036DE237-A69B-4E7E-929E-62C2AB7A7323_multinode,q=%22slevel_Url_8:Fotos%22,fl=%22contentid%22)] > > Same search works fine if called directly with /export or /select > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16703) Clearing all documents of an index should delete traces of a previous Lucene version
[ https://issues.apache.org/jira/browse/SOLR-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845135#comment-17845135 ] Rahul Goswami commented on SOLR-16703: -- I have done some work in this area and happy to take this up. Tied up for the next one month, but will get to this by end of June/early July 2024. > Clearing all documents of an index should delete traces of a previous Lucene > version > > > Key: SOLR-16703 > URL: https://issues.apache.org/jira/browse/SOLR-16703 > Project: Solr > Issue Type: Improvement >Affects Versions: 7.6, 8.11.2, 9.1.1 >Reporter: Gaël Jourdan >Priority: Major > > _This is a ticket following a discussion on Slack with_ [~elyograg] _and_ > [~wunder] _especially._ > h1. High level scenario > Assume you're starting from a current Solr server in version 7.x and want to > upgrade to 8.x then 9.x. > Upgrading from 7.x to 8.x works fine. Indexes of 7.x can still be read with > Solr 8.x. > On a regular basis, you clear* the index to start fresh, assuming this will > recreate index in version 8.x. > This run nicely for some time. Then you want to upgrade to 9.x. When > starting, you get an error saying that the index is still 7.x and cannot be > read by 9.x. > > *This is surprising because you'd expect that starting from a fresh index in > 8.x would have removed any trace of 7.x.* > > _* : when I say "clear", I mean "delete by query \{{* : * }}all docs" and > then commit + optionally optimize._ > h1. What I'd like to see > Clearing an index when running Solr version N should delete any trace of > Lucene version N-1. > Otherwise this forces users to delete an index (core / collection) and > recreate it rather than just clearing it. > h1. Detailed scenario to reproduce > The following steps reproduces the issue with a standalone Solr instance > running in Docker but I experienced the issue in SolrCloud mode running on > VMs and/or bare-metal. > > Also note that for personal troubleshooting I used the tool "luceneupgrader" > available at [https://github.com/hakanai/luceneupgrader] but it's not > necessary to reproduce the issue. > > 1. Create a directory for data > {code:java} > $ mkdir solrdata > $ chmod -R a+rwx solrdata {code} > > 2. Start a Solr 7.x server, create a core and push some docs > {code:java} > $ docker run -d -v "$PWD/solrdata:/opt/solr/server/solr/mycores:rw" -p > 8983:8983 --name my_solr_7 solr:7.6.0 solr-precreate gettingstarted > $ docker exec -it my_solr_7 post -c gettingstarted > example/exampledocs/manufacturers.xml > $ curl -s 'http://localhost:8983/solr/gettingstarted/select?q=*:*' | jq > .response.numFound > 11{code} > > 3. Look at the index files and check version > {code:java} > $ ll solrdata/gettingstarted/data/index > > total 40K > -rw-r--r--. 1 8983 8983 718 16 mars 17:37 _0.fdt > -rw-r--r--. 1 8983 8983 84 16 mars 17:37 _0.fdx > -rw-r--r--. 1 8983 8983 656 16 mars 17:37 _0.fnm > -rw-r--r--. 1 8983 8983 112 16 mars 17:37 _0_Lucene50_0.doc > -rw-r--r--. 1 8983 8983 1,1K 16 mars 17:37 _0_Lucene50_0.tim > -rw-r--r--. 1 8983 8983 145 16 mars 17:37 _0_Lucene50_0.tip > -rw-r--r--. 1 8983 8983 767 16 mars 17:37 _0_Lucene70_0.dvd > -rw-r--r--. 1 8983 8983 730 16 mars 17:37 _0_Lucene70_0.dvm > -rw-r--r--. 1 8983 8983 478 16 mars 17:37 _0.si > -rw-r--r--. 1 8983 8983 203 16 mars 17:37 segments_2 > -rw-r--r--. 1 8983 8983 0 16 mars 17:36 write.lock > $ java -jar luceneupgrader-0.6.0.jar info solrdata/gettingstarted/data/index > Lucene index version: 7 > {code} > > 4. Stop Solr 7, update solrconfig.xml for Solr 8 and start a Solr 8 server > {code:java} > $ docker stop my_solr_7 > $ vim solrdata/gettingstarted/conf/solrconfig.xml > $ cat solrdata/gettingstarted/conf/solrconfig.xml | grep luceneMatchVersion > 8.11.2 > $ docker run -d -v "$PWD/solrdata:/var/solr/data:rw" -p 8983:8983 --name > my_solr_8 solr:8.11.2{code} > > 5. Check index is loaded ok and docs are still there > {code:java} > $ curl -s 'http://localhost:8983/solr/gettingstarted/select?q=*:*' | jq > .response.numFound > 11 {code} > > 6. Clear the index and check index files / version > {code:java} > $ curl -X POST -H 'Content-Type: application/json' > 'http://localhost:8983/solr/gettingstarted/update?commit=true' -d '{ > "delete": {"query":"*:*"} }' > $ ll solrdata/gettingstarted/data/index > total 4,0K > -rw-r--r--. 1 8983 8983 135 16 mars 17:45 segments_5 > -rw-r--r--. 1 8983 8983 0 16 mars 17:36 write.lock > $ java -jar luceneupgrader-0.6.0.jar info solrdata/gettingstarted/data/index > Lucene index version: 7 > $ curl 'http://localhost:8983/solr/gettingstarted/update?optimize=true' >
[jira] [Comment Edited] (SOLR-16703) Clearing all documents of an index should delete traces of a previous Lucene version
[ https://issues.apache.org/jira/browse/SOLR-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845135#comment-17845135 ] Rahul Goswami edited comment on SOLR-16703 at 5/9/24 10:01 PM: --- I have done good bit of work in this area and happy to take this up. Tied up for the next one month, but will get to this by end of June/early July 2024. was (Author: rahul196...@gmail.com): I have done some work in this area and happy to take this up. Tied up for the next one month, but will get to this by end of June/early July 2024. > Clearing all documents of an index should delete traces of a previous Lucene > version > > > Key: SOLR-16703 > URL: https://issues.apache.org/jira/browse/SOLR-16703 > Project: Solr > Issue Type: Improvement >Affects Versions: 7.6, 8.11.2, 9.1.1 >Reporter: Gaël Jourdan >Priority: Major > > _This is a ticket following a discussion on Slack with_ [~elyograg] _and_ > [~wunder] _especially._ > h1. High level scenario > Assume you're starting from a current Solr server in version 7.x and want to > upgrade to 8.x then 9.x. > Upgrading from 7.x to 8.x works fine. Indexes of 7.x can still be read with > Solr 8.x. > On a regular basis, you clear* the index to start fresh, assuming this will > recreate index in version 8.x. > This run nicely for some time. Then you want to upgrade to 9.x. When > starting, you get an error saying that the index is still 7.x and cannot be > read by 9.x. > > *This is surprising because you'd expect that starting from a fresh index in > 8.x would have removed any trace of 7.x.* > > _* : when I say "clear", I mean "delete by query \{{* : * }}all docs" and > then commit + optionally optimize._ > h1. What I'd like to see > Clearing an index when running Solr version N should delete any trace of > Lucene version N-1. > Otherwise this forces users to delete an index (core / collection) and > recreate it rather than just clearing it. > h1. Detailed scenario to reproduce > The following steps reproduces the issue with a standalone Solr instance > running in Docker but I experienced the issue in SolrCloud mode running on > VMs and/or bare-metal. > > Also note that for personal troubleshooting I used the tool "luceneupgrader" > available at [https://github.com/hakanai/luceneupgrader] but it's not > necessary to reproduce the issue. > > 1. Create a directory for data > {code:java} > $ mkdir solrdata > $ chmod -R a+rwx solrdata {code} > > 2. Start a Solr 7.x server, create a core and push some docs > {code:java} > $ docker run -d -v "$PWD/solrdata:/opt/solr/server/solr/mycores:rw" -p > 8983:8983 --name my_solr_7 solr:7.6.0 solr-precreate gettingstarted > $ docker exec -it my_solr_7 post -c gettingstarted > example/exampledocs/manufacturers.xml > $ curl -s 'http://localhost:8983/solr/gettingstarted/select?q=*:*' | jq > .response.numFound > 11{code} > > 3. Look at the index files and check version > {code:java} > $ ll solrdata/gettingstarted/data/index > > total 40K > -rw-r--r--. 1 8983 8983 718 16 mars 17:37 _0.fdt > -rw-r--r--. 1 8983 8983 84 16 mars 17:37 _0.fdx > -rw-r--r--. 1 8983 8983 656 16 mars 17:37 _0.fnm > -rw-r--r--. 1 8983 8983 112 16 mars 17:37 _0_Lucene50_0.doc > -rw-r--r--. 1 8983 8983 1,1K 16 mars 17:37 _0_Lucene50_0.tim > -rw-r--r--. 1 8983 8983 145 16 mars 17:37 _0_Lucene50_0.tip > -rw-r--r--. 1 8983 8983 767 16 mars 17:37 _0_Lucene70_0.dvd > -rw-r--r--. 1 8983 8983 730 16 mars 17:37 _0_Lucene70_0.dvm > -rw-r--r--. 1 8983 8983 478 16 mars 17:37 _0.si > -rw-r--r--. 1 8983 8983 203 16 mars 17:37 segments_2 > -rw-r--r--. 1 8983 8983 0 16 mars 17:36 write.lock > $ java -jar luceneupgrader-0.6.0.jar info solrdata/gettingstarted/data/index > Lucene index version: 7 > {code} > > 4. Stop Solr 7, update solrconfig.xml for Solr 8 and start a Solr 8 server > {code:java} > $ docker stop my_solr_7 > $ vim solrdata/gettingstarted/conf/solrconfig.xml > $ cat solrdata/gettingstarted/conf/solrconfig.xml | grep luceneMatchVersion > 8.11.2 > $ docker run -d -v "$PWD/solrdata:/var/solr/data:rw" -p 8983:8983 --name > my_solr_8 solr:8.11.2{code} > > 5. Check index is loaded ok and docs are still there > {code:java} > $ curl -s 'http://localhost:8983/solr/gettingstarted/select?q=*:*' | jq > .response.numFound > 11 {code} > > 6. Clear the index and check index files / version > {code:java} > $ curl -X POST -H 'Content-Type: application/json' > 'http://localhost:8983/solr/gettingstarted/update?commit=true' -d '{ > "delete": {"query":"*:*"} }' > $ ll solrdata/gettingstarted/data/index > total 4,0K > -rw-r--r--. 1 8983 8983 135 16 mars 17:45 seg
[jira] [Created] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
Rahul Goswami created SOLR-16838: Summary: Atomic updates too slow in Solr 8 vs Solr 7 Key: SOLR-16838 URL: https://issues.apache.org/jira/browse/SOLR-16838 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: SearchComponents - other Affects Versions: 8.11.1 Reporter: Rahul Goswami Started experiencing slowness with updates in production after upgrading from Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that indexing 20 million docs via atomic updates through the same client program (running 15 parallel threads indexing in batches of 1000) takes below time: Solr 7 : 78 mins Solr 8: 370 mins Environment details: - Java 11 on Windows server - Xms1536m Xmx3072m - Indexing client code running 15 parallel threads indexing in batches of 1000 - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on Windows for our index sizes which commonly run north of 1 TB) Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can see that Solr 7 takes a different code path than Solr 8. Note that the performance of regular updates (non-atomic) is still pretty good on Solr 8 completing in < 1 hour for the same 20 million data set. Sharing the indexing code, solrconfig, schema and thread dumps in the link below: [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729551#comment-17729551 ] Rahul Goswami commented on SOLR-16838: -- I ran the test to index 5 million docs (batches of 1000 docs in 15 parallel threads). To eliminate the network overhead and get as accurate a benchmark as possible, I used an AtomicLong to measure the time around the RTG call in DistibutedUpdateProcessor across all calls ([https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.7.2/solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java#L1416]). Did this for both Solr 7.7.2 and Solr 8.11.1 and built the solr-core.jar to replace it in the solr webapp lib. RTG in Solr 8.x is ~10x slower. Here are the numbers (times are in milliseconds): *+Solr 7.7.2+* : 2023-06-01 15:39:48.272 WARN (qtp1034094674-24) [ x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory *+Total rtg time:7293486+* *{+}Solr 8.11.1{+}:* 2023-06-01 04:46:24.758 WARN (qtp391506011-71) [ x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory *+Total rtg time:72029877+* > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729558#comment-17729558 ] Rahul Goswami commented on SOLR-16838: -- Running further benchmarks reveals that the slowness is in the searcher.getFirstMatch() call inside getInputDocument() . The call eventually ends up in Lucene's SegmentTermsEnum.seekExact() which is where the regression seems to be. *+Solr 7.7.2+* 2023-06-01 21:17:34.492 WARN (qtp1034094674-41) [ x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory RTG timing stats:: tlogFetchTime: 508053 ; *searcherFetchTime: 3229011* *+Solr 8+* 2023-06-01 20:43:31.767 WARN (qtp391506011-56) [ x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory RTG timing stats:: tlogFetchTime: 410873 ; *searcherFetchTime: 33296008* > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729558#comment-17729558 ] Rahul Goswami edited comment on SOLR-16838 at 6/6/23 3:51 AM: -- Running further benchmarks (this time for 3 million docs) reveals that the slowness is in the searcher.getFirstMatch() call inside getInputDocument() . The call eventually ends up in Lucene's SegmentTermsEnum.seekExact() which is where the regression seems to be. *+Solr 7.7.2+* 2023-06-01 21:17:34.492 WARN (qtp1034094674-41) [ x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory RTG timing stats:: tlogFetchTime: 508053 ; *searcherFetchTime: 3229011* *+Solr 8+* 2023-06-01 20:43:31.767 WARN (qtp391506011-56) [ x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory RTG timing stats:: tlogFetchTime: 410873 ; *searcherFetchTime: 33296008* was (Author: rahul196...@gmail.com): Running further benchmarks reveals that the slowness is in the searcher.getFirstMatch() call inside getInputDocument() . The call eventually ends up in Lucene's SegmentTermsEnum.seekExact() which is where the regression seems to be. *+Solr 7.7.2+* 2023-06-01 21:17:34.492 WARN (qtp1034094674-41) [ x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory RTG timing stats:: tlogFetchTime: 508053 ; *searcherFetchTime: 3229011* *+Solr 8+* 2023-06-01 20:43:31.767 WARN (qtp391506011-56) [ x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory RTG timing stats:: tlogFetchTime: 410873 ; *searcherFetchTime: 33296008* > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729921#comment-17729921 ] Rahul Goswami commented on SOLR-16838: -- Yes, it has always been commented out. For reproducing the issue, the index has also been deleted multiple times and rebuilt against the same schema. Also made sure the dynamic field "*" doesn't exist either to eliminate the possibility of _{_}root_{_} being treated as a dynamic field. > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > Labels: RTG, RealTimeGet, atomicupdate > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729921#comment-17729921 ] Rahul Goswami edited comment on SOLR-16838 at 6/7/23 4:04 AM: -- Yes, it has always been commented out. For reproducing the issue, the index has also been deleted multiple times and rebuilt against the same schema. Also made sure the dynamic field "*" doesn't exist either to eliminate the possibility of _root_ being treated as a dynamic field. was (Author: rahul196...@gmail.com): Yes, it has always been commented out. For reproducing the issue, the index has also been deleted multiple times and rebuilt against the same schema. Also made sure the dynamic field "*" doesn't exist either to eliminate the possibility of _{_}root_{_} being treated as a dynamic field. > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > Labels: RTG, RealTimeGet, atomicupdate > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729921#comment-17729921 ] Rahul Goswami edited comment on SOLR-16838 at 6/7/23 4:05 AM: -- Yes, it has always been commented out. For reproducing the issue, the index has also been deleted multiple times and rebuilt against the same schema. Also made sure the dynamic field "*" doesn't exist either to eliminate the possibility of _root_ being treated as a dynamic field. was (Author: rahul196...@gmail.com): Yes, it has always been commented out. For reproducing the issue, the index has also been deleted multiple times and rebuilt against the same schema. Also made sure the dynamic field "*" doesn't exist either to eliminate the possibility of _root_ being treated as a dynamic field. > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > Labels: RTG, RealTimeGet, atomicupdate > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729921#comment-17729921 ] Rahul Goswami edited comment on SOLR-16838 at 6/7/23 4:05 AM: -- Yes, it has always been commented out. For reproducing the issue, the index has also been deleted multiple times and rebuilt against the same schema. Also made sure the dynamic field "*" doesn't exist either to eliminate the possibility of \_root\_ being treated as a dynamic field. was (Author: rahul196...@gmail.com): Yes, it has always been commented out. For reproducing the issue, the index has also been deleted multiple times and rebuilt against the same schema. Also made sure the dynamic field "*" doesn't exist either to eliminate the possibility of _root_ being treated as a dynamic field. > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > Labels: RTG, RealTimeGet, atomicupdate > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730247#comment-17730247 ] Rahul Goswami commented on SOLR-16838: -- The regression seems to be in the Lucene layer. Quoting the discussion on this issue on the Lucene list: " - 8.0 moved the terms index off-heap for non-PK fields with MMapDirectory. [https://github.com/apache/lucene/issues/9681] - Then in 8.6 the FST was moved off-heap all the time. [https://github.com/apache/lucene/issues/10297]"; So now the terms index is off-heap, and due to Lucene's FST reading bytes backwards readByte() call causes disk access for every single byte . The below tickets have been opened by Adrien Grand on the issue for further discussion: [https://github.com/apache/lucene/issues/12355] and [https://github.com/apache/lucene/issues/12356]. > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > Labels: RTG, RealTimeGet, atomicupdate > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730247#comment-17730247 ] Rahul Goswami edited comment on SOLR-16838 at 6/7/23 6:41 PM: -- The regression seems to be in the Lucene layer. Quoting the discussion on this issue on the Lucene list: " - 8.0 moved the terms index off-heap for non-PK fields with MMapDirectory. [https://github.com/apache/lucene/issues/9681] - Then in 8.6 the FST was moved off-heap all the time. [https://github.com/apache/lucene/issues/10297]"; So now the terms index is off-heap, and due to Lucene's FST reading bytes backwards readByte() call causes disk access for every 1kB of buffer. The below tickets have been opened by Adrien Grand on the issue for further discussion: [https://github.com/apache/lucene/issues/12355] and [https://github.com/apache/lucene/issues/12356]. was (Author: rahul196...@gmail.com): The regression seems to be in the Lucene layer. Quoting the discussion on this issue on the Lucene list: " - 8.0 moved the terms index off-heap for non-PK fields with MMapDirectory. [https://github.com/apache/lucene/issues/9681] - Then in 8.6 the FST was moved off-heap all the time. [https://github.com/apache/lucene/issues/10297]"; So now the terms index is off-heap, and due to Lucene's FST reading bytes backwards readByte() call causes disk access for every single byte . The below tickets have been opened by Adrien Grand on the issue for further discussion: [https://github.com/apache/lucene/issues/12355] and [https://github.com/apache/lucene/issues/12356]. > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > Labels: RTG, RealTimeGet, atomicupdate > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730247#comment-17730247 ] Rahul Goswami edited comment on SOLR-16838 at 6/7/23 6:42 PM: -- The regression seems to be in the Lucene layer. Quoting the discussion on this issue on the Lucene list: " - 8.0 moved the terms index off-heap for non-PK fields with MMapDirectory. [https://github.com/apache/lucene/issues/9681] - Then in 8.6 the FST was moved off-heap all the time. [https://github.com/apache/lucene/issues/10297]"; So now the terms index is off-heap, and due to Lucene's FST reading bytes backwards readByte() call causes disk access for every single byte read. The below tickets have been opened by Adrien Grand on the issue for further discussion: [https://github.com/apache/lucene/issues/12355] and [https://github.com/apache/lucene/issues/12356]. was (Author: rahul196...@gmail.com): The regression seems to be in the Lucene layer. Quoting the discussion on this issue on the Lucene list: " - 8.0 moved the terms index off-heap for non-PK fields with MMapDirectory. [https://github.com/apache/lucene/issues/9681] - Then in 8.6 the FST was moved off-heap all the time. [https://github.com/apache/lucene/issues/10297]"; So now the terms index is off-heap, and due to Lucene's FST reading bytes backwards readByte() call causes disk access for every 1kB of buffer. The below tickets have been opened by Adrien Grand on the issue for further discussion: [https://github.com/apache/lucene/issues/12355] and [https://github.com/apache/lucene/issues/12356]. > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > Labels: RTG, RealTimeGet, atomicupdate > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730247#comment-17730247 ] Rahul Goswami edited comment on SOLR-16838 at 6/7/23 9:32 PM: -- The regression seems to be in the Lucene layer. Quoting the discussion on this issue on the Lucene list (https://lists.apache.org/thread/1fskhmz84pp60o41txsxj2193vt9txod): " - 8.0 moved the terms index off-heap for non-PK fields with MMapDirectory. [https://github.com/apache/lucene/issues/9681] - Then in 8.6 the FST was moved off-heap all the time. [https://github.com/apache/lucene/issues/10297]"; So now the terms index is off-heap, and due to Lucene's FST reading bytes backwards readByte() call causes disk access for every single byte read. The below tickets have been opened by Adrien Grand on the issue for further discussion: [https://github.com/apache/lucene/issues/12355] and [https://github.com/apache/lucene/issues/12356]. was (Author: rahul196...@gmail.com): The regression seems to be in the Lucene layer. Quoting the discussion on this issue on the Lucene list: " - 8.0 moved the terms index off-heap for non-PK fields with MMapDirectory. [https://github.com/apache/lucene/issues/9681] - Then in 8.6 the FST was moved off-heap all the time. [https://github.com/apache/lucene/issues/10297]"; So now the terms index is off-heap, and due to Lucene's FST reading bytes backwards readByte() call causes disk access for every single byte read. The below tickets have been opened by Adrien Grand on the issue for further discussion: [https://github.com/apache/lucene/issues/12355] and [https://github.com/apache/lucene/issues/12356]. > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > Labels: RTG, RealTimeGet, atomicupdate > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730680#comment-17730680 ] Rahul Goswami commented on SOLR-16838: -- Not sure about Mmap, but NIOFSDirectory also has similar regression. > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > Labels: RTG, RealTimeGet, atomicupdate > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742498#comment-17742498 ] Rahul Goswami commented on SOLR-16838: -- Missed the last couple of comments, sorry! [~janhoy] I backported the Lucene fix to 8.11.1 which is the version I have been testing on and found a dramatic improvement in performance. For a 20 million dataset, indexing in 15 parallel threads in batches of 1000, here are the before and after fix times: Before fix: 370 mins After fix: 65 mins Note that this performance on an average is still tad slower than 7.7.2 across multiple runs, but I guess that can be attributed to the fact that the terms index is no longer loaded on-heap as of Lucene 8.6 (https://github.com/apache/lucene/issues/10297). > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > Labels: RTG, RealTimeGet, atomicupdate > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16838) Atomic updates too slow in Solr 8 vs Solr 7
[ https://issues.apache.org/jira/browse/SOLR-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742504#comment-17742504 ] Rahul Goswami commented on SOLR-16838: -- [~elyograg] In some scenarios, the problem with Mmap becomes more operational than technical. For a deployment in a customer setting, the customer hits cost concerns with providing enough RAM for MMap (on multiple nodes) to work effectively. With SimpleFS/NIOFs with sufficient optimizations, we are able to run multiple TB indexes effectively on a 64 GB box with 31 GB heap. Even though I agree that MMap works more efficiently on Linux than Windows, it would still not work efficiently under similar memory constraints. > Atomic updates too slow in Solr 8 vs Solr 7 > --- > > Key: SOLR-16838 > URL: https://issues.apache.org/jira/browse/SOLR-16838 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: 8.11.1 >Reporter: Rahul Goswami >Priority: Major > Labels: RTG, RealTimeGet, atomicupdate > > Started experiencing slowness with updates in production after upgrading from > Solr 7.7.2 to 8.11.1. Upon comparing the performance it turns out that > indexing 20 million docs via atomic updates through the same client program > (running 15 parallel threads indexing in batches of 1000) takes below time: > > Solr 7 : 78 mins > Solr 8: 370 mins > > Environment details: > - Java 11 on Windows server > - Xms1536m Xmx3072m > - Indexing client code running 15 parallel threads indexing in batches of 1000 > - using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on > Windows for our index sizes which commonly run north of 1 TB) > > Looking at the thread dump, the bottleneck seems to be RealTimeGet and I can > see that Solr 7 takes a different code path than Solr 8. Note that the > performance of regular updates (non-atomic) is still pretty good on Solr 8 > completing in < 1 hour for the same 20 million data set. > > Sharing the indexing code, solrconfig, schema and thread dumps in the link > below: > [https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16360) Atomic update on boolean fields doesn't reflect when value starts with "1", "t" or "T"
[ https://issues.apache.org/jira/browse/SOLR-16360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750535#comment-17750535 ] Rahul Goswami commented on SOLR-16360: -- RCA: during a regular (non-atomic) update, the toInternal() method gets called to interpret the value and hence the documented behavior is observed. However during atomic update, the toNativeType() method gets called which doesn't check for the first character of value, thereby breaking the behavior. > Atomic update on boolean fields doesn't reflect when value starts with "1", > "t" or "T" > -- > > Key: SOLR-16360 > URL: https://issues.apache.org/jira/browse/SOLR-16360 > Project: Solr > Issue Type: Bug >Affects Versions: 8.11 >Reporter: Rahul Goswami >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > I am running Solr 8.11. As per the Solr documentation, any value starting > with "1","t" or "T" for a boolean field is interpreted as true. > > [https://solr.apache.org/guide/8_11/field-types-included-with-solr.html#recommended-field-types] > > However, I hit a potential Solr bug where if the String value "1","t" or "T" > is passed in an atomic update, it is treated as false. > > //Eg:Below document is indexed first => query returns "inStock" as true (as > expected) > { > "id":"test", > "inStock":"true" > } > > //Follow above update with below atomic update and commit. => inStock becomes > false in query result > { > "id":"test", > "inStock":\{"set":"1"} > } > > This doesn't happen though if value "1" is passed in a regular update. > Eg:Below update reflects the value of inStock as true when queried. > { > "id":"test", > "inStock":"1" > } -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17359) Make SolrCLI handle arg parsing of zk sub commands
[ https://issues.apache.org/jira/browse/SOLR-17359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873592#comment-17873592 ] Rahul Goswami commented on SOLR-17359: -- Thanks for your work on this Eric. This was a tedious effort! > Make SolrCLI handle arg parsing of zk sub commands > -- > > Key: SOLR-17359 > URL: https://issues.apache.org/jira/browse/SOLR-17359 > Project: Solr > Issue Type: Sub-task > Components: scripts and tools >Reporter: Jan Høydahl >Priority: Major > Labels: pull-request-available > Fix For: 9.7 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Both bin/solr and bin/solr.cmd have lots of shell code to parse the zk sub > commands, and to print the usage text. We have both a short zk uage text and > the full one. > {code:java} > Usage: solr zk upconfig|downconfig -d -n [-z zkHost] > [-s solrUrl]" >solr zk cp [-r] [-z zkHost] [-s solrUrl]" >solr zk rm [-r] [-z zkHost] [-s solrUrl]" >solr zk mv [-z zkHost] [-s solrUrl]" >solr zk ls [-r] [-z zkHost] [-s solrUrl]" >solr zk mkroot [-z zkHost] [-s solrUrl]" >solr zk linkconfig --conf-name -c [-z zkHost] > [-s solrUrl]" >solr zk updateacls [-z zkHost] [-s solrUrl]" {code} > Extend SolrCLI and tools API to handle sub commands more natively so that > doing {{solr zk -h}} shows a list of sub commands, while `solr zk cp -h` > shows usage for that sub command. > I think commons-cli does not have native subcommand support like e.g. > picocli, but it should be possible to implement.. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-7962) Passing additional arguments to solr.cmd using "-a" does not work on Windows
[ https://issues.apache.org/jira/browse/SOLR-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905719#comment-17905719 ] Rahul Goswami commented on SOLR-7962: - I’ll try this and report back > Passing additional arguments to solr.cmd using "-a" does not work on Windows > > > Key: SOLR-7962 > URL: https://issues.apache.org/jira/browse/SOLR-7962 > Project: Solr > Issue Type: Bug >Affects Versions: 5.3 >Reporter: Dawid Weiss >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Goswami updated SOLR-17725: - Description: Today upgrading from Solr version X to X+2 requires complete reingestion of data from source. This comes from Lucene's constraint which only guarantees index compatibility between the version the index was created in and the immediate next version. This reindexing usually comes with added downtime and/or cost. Especially in case of deployments which are in customer environments and not completely in control of the vendor, this proposition of having to completely reindex the data can become a hard sell. We at Commvault have developed a way which achieves this reindexing in-place on the same index. Also, the process automatically keeps "upgrading" the indexes over multiple subsequent Solr upgrades without needing manual intervention. It comes with the following limitations: i) All _source_ fields need to be either stored=true or docValues=true. Any copyField destination fields can be stored=false of course, just that the source fields (or more precisely, the source fields you care about preserving) should be either stored or docValues true. ii) The datatype of an existing field in schema.xml shouldn't change upon Solr upgrade. Introducing new fields is fine. For indexes where this limitation is not a problem (it wasn't for us!), the tool can reindex in-place on the same core with zero downtime and legitimately "upgrade" the index. This can remove a lot of operational headaches, especially in environments with hundreds/thousands of very large indexes. was: Today upgrading from Solr version X to X+2 requires complete reingestion of data from source. This comes from Lucene's constraint which only guarantees index compatibility between the version the index was created in and the immediate next version. This reindexing usually comes with added downtime and/or cost. Especially in case of deployments which are in customer environments and not completely in control of the vendor, this proposition of having to completely reindex the data can become a hard sell. I have developed a way which achieves this reindexing in-place on the same index. Also, the process automatically keeps "upgrading" the indexes over multiple subsequent Solr upgrades without needing manual intervention. It comes with the following limitations: i) All _source_ fields need to be either stored=true or docValues=true. Any copyField destination fields can be stored=false of course, just that the source fields (or more precisely, the source fields you care about preserving) should be either stored or docValues true. ii) The datatype of an existing field in schema.xml shouldn't change upon Solr upgrade. Introducing new fields is fine. For indexes where this limitation is not a problem (it wasn't for us!), the tool can reindex in-place on the same core with zero downtime and legitimately "upgrade" the index. This can remove a lot of operational headaches, especially in environments with hundreds/thousands of very large indexes. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > We at Commvault have developed a way which achieves this reindexing in-place > on the same index. Also, the process automatically keeps "upgrading" the > indexes over multiple subsequent Solr upgrades without needing manual > intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" th
[jira] [Updated] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Goswami updated SOLR-17725: - Attachment: High Level Design.png > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941691#comment-17941691 ] Rahul Goswami commented on SOLR-17725: -- [~ab] For those running SolrCloud AND having enough capacity in terms of infrastructure and budget, the REINDEXCOLLECTION command is a good option. I see that it reindexes onto a parallel collection. So for clusters with hundreds/thousands of large indexes, that cost can be substantial. Also the source collection is put in read-only mode while the reindexing happens. So can be a point of contention in case of environments which are more update heavy than search heavy (for eg: for us at Commvault). By means of this Jira I am attempting to overcome the Lucene limitation which forces you to reindex from source, when you really don't HAVE to. At least I would like to offer that option to users who are more cost sensitive or operationally sensitive (eg: Solutions which package Solr as part of the application and are installed/deployed on customer sites. It can be awkward to reason with customers as to why a solution upgrade may need a downtime if it involves a Solr upgrade). The proposed solution reindexes into the same core, can be easily adapted to work with both standalone Solr and SolrCloud, and allows both updates and searches to be served while doing so. This also helps remove additional operational overhead since now users can focus on just the Solr upgrade without having to worry about index compatibility. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
Rahul Goswami created SOLR-17725: Summary: Automatically upgrade Solr indexes without needing to reindex from source Key: SOLR-17725 URL: https://issues.apache.org/jira/browse/SOLR-17725 Project: Solr Issue Type: Improvement Reporter: Rahul Goswami Today upgrading from Solr version X to X+2 requires complete reingestion of data from source. This comes from Lucene's constraint which only guarantees index compatibility between the version the index was created in and the immediate next version. This reindexing usually comes with added downtime and/or cost. Especially in case of deployments which are in customer environments and not completely in control of the vendor, this proposition of having to completely reindex the data can become a hard sell. I have developed a way which achieves this reindexing in-place on the same index. Also, the process automatically keeps "upgrading" the indexes over multiple subsequent Solr upgrades without needing manual intervention. It comes with the following limitations: i) All _source_ fields need to be either stored=true or docValues=true. Any copyField destination fields can be stored=false of course, just that the source fields (or more precisely, the source fields you care about preserving) should be either stored or docValues true. ii) The datatype of an existing field in schema.xml shouldn't change upon Solr upgrade. Introducing new fields is fine. For indexes where this limitation is not a problem (it wasn't for us!), the tool can reindex in-place on the same core with zero downtime and legitimately "upgrade" the index. This can remove a lot of operational headaches, especially in environments with hundreds/thousands of very large indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Goswami updated SOLR-17725: - Description: Today upgrading from Solr version X to X+2 requires complete reingestion of data from source. This comes from Lucene's constraint which only guarantees index compatibility between the version the index was created in and the immediate next version. This reindexing usually comes with added downtime and/or cost. Especially in case of deployments which are in customer environments and not completely in control of the vendor, this proposition of having to completely reindex the data can become a hard sell. I, on behalf of my employer, Commvault, have developed a way which achieves this reindexing in-place on the same index. Also, the process automatically keeps "upgrading" the indexes over multiple subsequent Solr upgrades without needing manual intervention. It comes with the following limitations: i) All _source_ fields need to be either stored=true or docValues=true. Any copyField destination fields can be stored=false of course, just that the source fields (or more precisely, the source fields you care about preserving) should be either stored or docValues true. ii) The datatype of an existing field in schema.xml shouldn't change upon Solr upgrade. Introducing new fields is fine. For indexes where this limitation is not a problem (it wasn't for us!), the tool can reindex in-place on the same core with zero downtime and legitimately "upgrade" the index. This can remove a lot of operational headaches, especially in environments with hundreds/thousands of very large indexes. was: Today upgrading from Solr version X to X+2 requires complete reingestion of data from source. This comes from Lucene's constraint which only guarantees index compatibility between the version the index was created in and the immediate next version. This reindexing usually comes with added downtime and/or cost. Especially in case of deployments which are in customer environments and not completely in control of the vendor, this proposition of having to completely reindex the data can become a hard sell. We at Commvault have developed a way which achieves this reindexing in-place on the same index. Also, the process automatically keeps "upgrading" the indexes over multiple subsequent Solr upgrades without needing manual intervention. It comes with the following limitations: i) All _source_ fields need to be either stored=true or docValues=true. Any copyField destination fields can be stored=false of course, just that the source fields (or more precisely, the source fields you care about preserving) should be either stored or docValues true. ii) The datatype of an existing field in schema.xml shouldn't change upon Solr upgrade. Introducing new fields is fine. For indexes where this limitation is not a problem (it wasn't for us!), the tool can reindex in-place on the same core with zero downtime and legitimately "upgrade" the index. This can remove a lot of operational headaches, especially in environments with hundreds/thousands of very large indexes. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on th
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17940243#comment-17940243 ] Rahul Goswami commented on SOLR-17725: -- Attached document outlines an example where the upgrade tool works on an index originally created in Solr 7.x, AFTER an upgrade to Solr 8.x. Key points: 1) Lucene version X can read index created in version X-1. Writing of new segments happens with the latest version codec. 2) When a segment merge happens, the segment maintains a version stamp "minVersion" which is the least version of the segment participating in a merge. 3) The segments_* file in a Lucene index maintains the Lucene version where the index was first created. The design doc outlines the process of converting all segments to the new version. It's sort of a pull model where you first upgrade and then "pull" the index to the current version. By the end of the process outlined in the doc, all segments get converted to the new version and the index in all respects is an "upgraded" index. The only missing piece is to update the index creation version in the commit point. I did this by exposing a method in Lucene's CommitInfos which validates the version of all segments and updates the creation version stamp in the commit point (we might need to request an API from Lucene here). When this index is opened in Solr 9.x, it can read this index (thanks to point #1) and the same process repeats to make the index ready for Solr 10.x. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-16703) Clearing all documents of an index should delete traces of a previous Lucene version
[ https://issues.apache.org/jira/browse/SOLR-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941685#comment-17941685 ] Rahul Goswami edited comment on SOLR-16703 at 4/7/25 6:43 PM: -- [~gjourdan] The effort is underway as part of https://issues.apache.org/jira/browse/SOLR-17725. The solution for the specific requirement in this Jira requires a change from Lucene folks to update the version in CommitInfos. We'll request an API to that effect as part of the above mentioned JIRA. was (Author: rahul196...@gmail.com): [~gjourdan] The effort is underway on https://issues.apache.org/jira/browse/SOLR-17725 > Clearing all documents of an index should delete traces of a previous Lucene > version > > > Key: SOLR-16703 > URL: https://issues.apache.org/jira/browse/SOLR-16703 > Project: Solr > Issue Type: Improvement >Affects Versions: 7.6, 8.11.2, 9.1.1 >Reporter: Gaël Jourdan >Priority: Major > > _This is a ticket following a discussion on Slack with_ [~elyograg] _and_ > [~wunder] _especially._ > h1. High level scenario > Assume you're starting from a current Solr server in version 7.x and want to > upgrade to 8.x then 9.x. > Upgrading from 7.x to 8.x works fine. Indexes of 7.x can still be read with > Solr 8.x. > On a regular basis, you clear* the index to start fresh, assuming this will > recreate index in version 8.x. > This run nicely for some time. Then you want to upgrade to 9.x. When > starting, you get an error saying that the index is still 7.x and cannot be > read by 9.x. > > *This is surprising because you'd expect that starting from a fresh index in > 8.x would have removed any trace of 7.x.* > > _* : when I say "clear", I mean "delete by query \{{* : * }}all docs" and > then commit + optionally optimize._ > h1. What I'd like to see > Clearing an index when running Solr version N should delete any trace of > Lucene version N-1. > Otherwise this forces users to delete an index (core / collection) and > recreate it rather than just clearing it. > h1. Detailed scenario to reproduce > The following steps reproduces the issue with a standalone Solr instance > running in Docker but I experienced the issue in SolrCloud mode running on > VMs and/or bare-metal. > > Also note that for personal troubleshooting I used the tool "luceneupgrader" > available at [https://github.com/hakanai/luceneupgrader] but it's not > necessary to reproduce the issue. > > 1. Create a directory for data > {code:java} > $ mkdir solrdata > $ chmod -R a+rwx solrdata {code} > > 2. Start a Solr 7.x server, create a core and push some docs > {code:java} > $ docker run -d -v "$PWD/solrdata:/opt/solr/server/solr/mycores:rw" -p > 8983:8983 --name my_solr_7 solr:7.6.0 solr-precreate gettingstarted > $ docker exec -it my_solr_7 post -c gettingstarted > example/exampledocs/manufacturers.xml > $ curl -s 'http://localhost:8983/solr/gettingstarted/select?q=*:*' | jq > .response.numFound > 11{code} > > 3. Look at the index files and check version > {code:java} > $ ll solrdata/gettingstarted/data/index > > total 40K > -rw-r--r--. 1 8983 8983 718 16 mars 17:37 _0.fdt > -rw-r--r--. 1 8983 8983 84 16 mars 17:37 _0.fdx > -rw-r--r--. 1 8983 8983 656 16 mars 17:37 _0.fnm > -rw-r--r--. 1 8983 8983 112 16 mars 17:37 _0_Lucene50_0.doc > -rw-r--r--. 1 8983 8983 1,1K 16 mars 17:37 _0_Lucene50_0.tim > -rw-r--r--. 1 8983 8983 145 16 mars 17:37 _0_Lucene50_0.tip > -rw-r--r--. 1 8983 8983 767 16 mars 17:37 _0_Lucene70_0.dvd > -rw-r--r--. 1 8983 8983 730 16 mars 17:37 _0_Lucene70_0.dvm > -rw-r--r--. 1 8983 8983 478 16 mars 17:37 _0.si > -rw-r--r--. 1 8983 8983 203 16 mars 17:37 segments_2 > -rw-r--r--. 1 8983 8983 0 16 mars 17:36 write.lock > $ java -jar luceneupgrader-0.6.0.jar info solrdata/gettingstarted/data/index > Lucene index version: 7 > {code} > > 4. Stop Solr 7, update solrconfig.xml for Solr 8 and start a Solr 8 server > {code:java} > $ docker stop my_solr_7 > $ vim solrdata/gettingstarted/conf/solrconfig.xml > $ cat solrdata/gettingstarted/conf/solrconfig.xml | grep luceneMatchVersion > 8.11.2 > $ docker run -d -v "$PWD/solrdata:/var/solr/data:rw" -p 8983:8983 --name > my_solr_8 solr:8.11.2{code} > > 5. Check index is loaded ok and docs are still there > {code:java} > $ curl -s 'http://localhost:8983/solr/gettingstarted/select?q=*:*' | jq > .response.numFound > 11 {code} > > 6. Clear the index and check index files / version > {code:java} > $ curl -X POST -H 'Content-Type: application/json' > 'http://localhost:8983/solr/gettingstarted/update?commit=true' -d '{ > "delete": {"query":"*:*"} }' > $ ll solrdata/gettingstarted/data/index
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941691#comment-17941691 ] Rahul Goswami edited comment on SOLR-17725 at 4/7/25 7:18 PM: -- [~ab] For those running SolrCloud AND having enough capacity in terms of infrastructure and budget, the REINDEXCOLLECTION command is a good option. I see that it reindexes onto a parallel collection. So for clusters with hundreds/thousands of large indexes, that cost can be substantial. Also the source collection is put in read-only mode while the reindexing happens. So can be a point of contention in case of environments which are more update heavy than search heavy (for eg: for us at Commvault). By means of this Jira I am attempting to overcome the Lucene limitation which forces you to reindex from source, when you really don't HAVE to. At least I would like to offer that option to users who are more cost sensitive or operationally sensitive (eg: Solutions which package Solr as part of the application and are installed/deployed on customer sites. It can be awkward to reason with customers as to why a solution upgrade may need a downtime/additional infra capacity if it involves a Solr upgrade). The proposed solution reindexes into the same core, can be easily adapted to work with both standalone Solr and SolrCloud, and allows both updates and searches to be served while doing so. This also helps remove additional operational overhead since now users can focus on just the Solr upgrade without having to worry about index compatibility. was (Author: rahul196...@gmail.com): [~ab] For those running SolrCloud AND having enough capacity in terms of infrastructure and budget, the REINDEXCOLLECTION command is a good option. I see that it reindexes onto a parallel collection. So for clusters with hundreds/thousands of large indexes, that cost can be substantial. Also the source collection is put in read-only mode while the reindexing happens. So can be a point of contention in case of environments which are more update heavy than search heavy (for eg: for us at Commvault). By means of this Jira I am attempting to overcome the Lucene limitation which forces you to reindex from source, when you really don't HAVE to. At least I would like to offer that option to users who are more cost sensitive or operationally sensitive (eg: Solutions which package Solr as part of the application and are installed/deployed on customer sites. It can be awkward to reason with customers as to why a solution upgrade may need a downtime if it involves a Solr upgrade). The proposed solution reindexes into the same core, can be easily adapted to work with both standalone Solr and SolrCloud, and allows both updates and searches to be served while doing so. This also helps remove additional operational overhead since now users can focus on just the Solr upgrade without having to worry about index compatibility. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > hea
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704 ] Rahul Goswami commented on SOLR-17725: -- [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in CommitInfos to check for > all the segments in a commit and ensure they are the new version in every > aspect before setting the created-version property. This already happens in a > synchronized block so in my (limited) opinion, it should be safe. The API > they give us can do all required internal validations and fail gracefully > without any harm to the index. I can get a discussion started with the Lucene > folks once we agree on the basics of this implemetation. Or do you suggest I > do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multipl
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704 ] Rahul Goswami edited comment on SOLR-17725 at 4/7/25 8:11 PM: -- [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in CommitInfos to check for > all the segments in a commit and ensure they are the new version in every > aspect before setting the created-version property. This already happens in a > synchronized block upon commit, so in my (limited) opinion, it should be > safe. The API they give us can do all required internal validations and fail > gracefully without any harm to the index. I can get a discussion started with > the Lucene folks once we agree on the basics of this implementation. Or do > you suggest I do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. was (Author: rahul196...@gmail.com): [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational
[jira] [Commented] (SOLR-16703) Clearing all documents of an index should delete traces of a previous Lucene version
[ https://issues.apache.org/jira/browse/SOLR-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945586#comment-17945586 ] Rahul Goswami commented on SOLR-16703: -- [~gjourdan] Just curious. Since you are ok with reindexing from source, what prevents you from physically deleting the "index" directory for each core/replica instead? That way reindexing will again populate the index without any trace of previous Solr/Lucene version and without you having to recreate the collection. The fix for your exact issue requires an API from Lucene which I am going to request anyway, but I expect them to ask the same question. > Clearing all documents of an index should delete traces of a previous Lucene > version > > > Key: SOLR-16703 > URL: https://issues.apache.org/jira/browse/SOLR-16703 > Project: Solr > Issue Type: Improvement >Affects Versions: 7.6, 8.11.2, 9.1.1 >Reporter: Gaël Jourdan >Priority: Major > > _This is a ticket following a discussion on Slack with_ [~elyograg] _and_ > [~wunder] _especially._ > h1. High level scenario > Assume you're starting from a current Solr server in version 7.x and want to > upgrade to 8.x then 9.x. > Upgrading from 7.x to 8.x works fine. Indexes of 7.x can still be read with > Solr 8.x. > On a regular basis, you clear* the index to start fresh, assuming this will > recreate index in version 8.x. > This run nicely for some time. Then you want to upgrade to 9.x. When > starting, you get an error saying that the index is still 7.x and cannot be > read by 9.x. > > *This is surprising because you'd expect that starting from a fresh index in > 8.x would have removed any trace of 7.x.* > > _* : when I say "clear", I mean "delete by query \{{* : * }}all docs" and > then commit + optionally optimize._ > h1. What I'd like to see > Clearing an index when running Solr version N should delete any trace of > Lucene version N-1. > Otherwise this forces users to delete an index (core / collection) and > recreate it rather than just clearing it. > h1. Detailed scenario to reproduce > The following steps reproduces the issue with a standalone Solr instance > running in Docker but I experienced the issue in SolrCloud mode running on > VMs and/or bare-metal. > > Also note that for personal troubleshooting I used the tool "luceneupgrader" > available at [https://github.com/hakanai/luceneupgrader] but it's not > necessary to reproduce the issue. > > 1. Create a directory for data > {code:java} > $ mkdir solrdata > $ chmod -R a+rwx solrdata {code} > > 2. Start a Solr 7.x server, create a core and push some docs > {code:java} > $ docker run -d -v "$PWD/solrdata:/opt/solr/server/solr/mycores:rw" -p > 8983:8983 --name my_solr_7 solr:7.6.0 solr-precreate gettingstarted > $ docker exec -it my_solr_7 post -c gettingstarted > example/exampledocs/manufacturers.xml > $ curl -s 'http://localhost:8983/solr/gettingstarted/select?q=*:*' | jq > .response.numFound > 11{code} > > 3. Look at the index files and check version > {code:java} > $ ll solrdata/gettingstarted/data/index > > total 40K > -rw-r--r--. 1 8983 8983 718 16 mars 17:37 _0.fdt > -rw-r--r--. 1 8983 8983 84 16 mars 17:37 _0.fdx > -rw-r--r--. 1 8983 8983 656 16 mars 17:37 _0.fnm > -rw-r--r--. 1 8983 8983 112 16 mars 17:37 _0_Lucene50_0.doc > -rw-r--r--. 1 8983 8983 1,1K 16 mars 17:37 _0_Lucene50_0.tim > -rw-r--r--. 1 8983 8983 145 16 mars 17:37 _0_Lucene50_0.tip > -rw-r--r--. 1 8983 8983 767 16 mars 17:37 _0_Lucene70_0.dvd > -rw-r--r--. 1 8983 8983 730 16 mars 17:37 _0_Lucene70_0.dvm > -rw-r--r--. 1 8983 8983 478 16 mars 17:37 _0.si > -rw-r--r--. 1 8983 8983 203 16 mars 17:37 segments_2 > -rw-r--r--. 1 8983 8983 0 16 mars 17:36 write.lock > $ java -jar luceneupgrader-0.6.0.jar info solrdata/gettingstarted/data/index > Lucene index version: 7 > {code} > > 4. Stop Solr 7, update solrconfig.xml for Solr 8 and start a Solr 8 server > {code:java} > $ docker stop my_solr_7 > $ vim solrdata/gettingstarted/conf/solrconfig.xml > $ cat solrdata/gettingstarted/conf/solrconfig.xml | grep luceneMatchVersion > 8.11.2 > $ docker run -d -v "$PWD/solrdata:/var/solr/data:rw" -p 8983:8983 --name > my_solr_8 solr:8.11.2{code} > > 5. Check index is loaded ok and docs are still there > {code:java} > $ curl -s 'http://localhost:8983/solr/gettingstarted/select?q=*:*' | jq > .response.numFound > 11 {code} > > 6. Clear the index and check index files / version > {code:java} > $ curl -X POST -H 'Content-Type: application/json' > 'http://localhost:8983/solr/gettingstarted/update?commit=true' -d '{ > "delete": {"query":"*:*"} }' > $ ll solrdata/gettingstarted/data/index
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17940243#comment-17940243 ] Rahul Goswami edited comment on SOLR-17725 at 4/19/25 12:00 AM: Attached document outlines an example where the upgrade tool works on an index originally created in Solr 7.x, AFTER an upgrade to Solr 8.x. Key points: 1) Lucene version X can read index created in version X-1. Writing of new segments happens with the latest version codec. 2) When a segment merge happens, the segment maintains a version stamp "minVersion" which is the least version of the segment participating in a merge. 3) The segments_* file in a Lucene index maintains the Lucene version where the index was first created. The design doc outlines the process of converting all segments to the new version. It's sort of a pull model where you first upgrade and then "pull" the index to the current version. By the end of the process outlined in the doc, all segments get converted to the new version and the index in all respects is an "upgraded" index. The only missing piece is to update the index creation version in the commit point. I did this by exposing a method in Lucene's IndexWriter which validates the version of all segments and updates the creation version stamp in the commit point (we might need to request an API from Lucene here). When this index is opened in Solr 9.x, it can read this index (thanks to point #1) and the same process repeats to make the index ready for Solr 10.x. was (Author: rahul196...@gmail.com): Attached document outlines an example where the upgrade tool works on an index originally created in Solr 7.x, AFTER an upgrade to Solr 8.x. Key points: 1) Lucene version X can read index created in version X-1. Writing of new segments happens with the latest version codec. 2) When a segment merge happens, the segment maintains a version stamp "minVersion" which is the least version of the segment participating in a merge. 3) The segments_* file in a Lucene index maintains the Lucene version where the index was first created. The design doc outlines the process of converting all segments to the new version. It's sort of a pull model where you first upgrade and then "pull" the index to the current version. By the end of the process outlined in the doc, all segments get converted to the new version and the index in all respects is an "upgraded" index. The only missing piece is to update the index creation version in the commit point. I did this by exposing a method in Lucene's CommitInfos which validates the version of all segments and updates the creation version stamp in the commit point (we might need to request an API from Lucene here). When this index is opened in Solr 9.x, it can read this index (thanks to point #1) and the same process repeats to make the index ready for Solr 10.x. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704 ] Rahul Goswami edited comment on SOLR-17725 at 4/19/25 12:03 AM: [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in Lucene IndexWriter to > check for all the segments in a commit and ensure they are the new version in > every aspect before setting the created-version property. This already > happens in a synchronized block upon commit, so in my (limited) opinion, it > should be safe. The API they give us can do all required internal validations > and fail gracefully without any harm to the index. I can get a discussion > started with the Lucene folks once we agree on the basics of this > implementation. Or do you suggest I do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. was (Author: rahul196...@gmail.com): [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > o
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396 ] Rahul Goswami edited comment on SOLR-17725 at 4/14/25 3:44 PM: --- Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the segment. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. was (Author: rahul196...@gmail.com): Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the service. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not compl
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396 ] Rahul Goswami edited comment on SOLR-17725 at 4/14/25 3:43 PM: --- Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the service. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. was (Author: rahul196...@gmail.com): Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we are read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the service. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396 ] Rahul Goswami commented on SOLR-17725: -- Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we are read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the service. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943738#comment-17943738 ] Rahul Goswami commented on SOLR-17725: -- [~janhoy] How do you recommend we proceed here? If you need me to elaborate on any part of the design, I am happy to do so (either here or a discussion over video chat or whatever is the norm with a new feature). If we need a wider audience to take a look at this, I am also happy to float this on the dev list. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396 ] Rahul Goswami edited comment on SOLR-17725 at 4/25/25 5:36 AM: --- Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the segment. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. was (Author: rahul196...@gmail.com): Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the segment. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17947233#comment-17947233 ] Rahul Goswami commented on SOLR-17725: -- Requested the API from Lucene a few days back and the discussion is underway at [https://lists.apache.org/thread/gk3kwplon73llz356szz1mn3myn3nnm3] . Was trying to avoid cross posting , but now thinking it might be ok to copy d...@solr.apache.org on the discussion(?) > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16703) Clearing all documents of an index should delete traces of a previous Lucene version
[ https://issues.apache.org/jira/browse/SOLR-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941685#comment-17941685 ] Rahul Goswami commented on SOLR-16703: -- [~gjourdan] The effort is underway on https://issues.apache.org/jira/browse/SOLR-17725 > Clearing all documents of an index should delete traces of a previous Lucene > version > > > Key: SOLR-16703 > URL: https://issues.apache.org/jira/browse/SOLR-16703 > Project: Solr > Issue Type: Improvement >Affects Versions: 7.6, 8.11.2, 9.1.1 >Reporter: Gaël Jourdan >Priority: Major > > _This is a ticket following a discussion on Slack with_ [~elyograg] _and_ > [~wunder] _especially._ > h1. High level scenario > Assume you're starting from a current Solr server in version 7.x and want to > upgrade to 8.x then 9.x. > Upgrading from 7.x to 8.x works fine. Indexes of 7.x can still be read with > Solr 8.x. > On a regular basis, you clear* the index to start fresh, assuming this will > recreate index in version 8.x. > This run nicely for some time. Then you want to upgrade to 9.x. When > starting, you get an error saying that the index is still 7.x and cannot be > read by 9.x. > > *This is surprising because you'd expect that starting from a fresh index in > 8.x would have removed any trace of 7.x.* > > _* : when I say "clear", I mean "delete by query \{{* : * }}all docs" and > then commit + optionally optimize._ > h1. What I'd like to see > Clearing an index when running Solr version N should delete any trace of > Lucene version N-1. > Otherwise this forces users to delete an index (core / collection) and > recreate it rather than just clearing it. > h1. Detailed scenario to reproduce > The following steps reproduces the issue with a standalone Solr instance > running in Docker but I experienced the issue in SolrCloud mode running on > VMs and/or bare-metal. > > Also note that for personal troubleshooting I used the tool "luceneupgrader" > available at [https://github.com/hakanai/luceneupgrader] but it's not > necessary to reproduce the issue. > > 1. Create a directory for data > {code:java} > $ mkdir solrdata > $ chmod -R a+rwx solrdata {code} > > 2. Start a Solr 7.x server, create a core and push some docs > {code:java} > $ docker run -d -v "$PWD/solrdata:/opt/solr/server/solr/mycores:rw" -p > 8983:8983 --name my_solr_7 solr:7.6.0 solr-precreate gettingstarted > $ docker exec -it my_solr_7 post -c gettingstarted > example/exampledocs/manufacturers.xml > $ curl -s 'http://localhost:8983/solr/gettingstarted/select?q=*:*' | jq > .response.numFound > 11{code} > > 3. Look at the index files and check version > {code:java} > $ ll solrdata/gettingstarted/data/index > > total 40K > -rw-r--r--. 1 8983 8983 718 16 mars 17:37 _0.fdt > -rw-r--r--. 1 8983 8983 84 16 mars 17:37 _0.fdx > -rw-r--r--. 1 8983 8983 656 16 mars 17:37 _0.fnm > -rw-r--r--. 1 8983 8983 112 16 mars 17:37 _0_Lucene50_0.doc > -rw-r--r--. 1 8983 8983 1,1K 16 mars 17:37 _0_Lucene50_0.tim > -rw-r--r--. 1 8983 8983 145 16 mars 17:37 _0_Lucene50_0.tip > -rw-r--r--. 1 8983 8983 767 16 mars 17:37 _0_Lucene70_0.dvd > -rw-r--r--. 1 8983 8983 730 16 mars 17:37 _0_Lucene70_0.dvm > -rw-r--r--. 1 8983 8983 478 16 mars 17:37 _0.si > -rw-r--r--. 1 8983 8983 203 16 mars 17:37 segments_2 > -rw-r--r--. 1 8983 8983 0 16 mars 17:36 write.lock > $ java -jar luceneupgrader-0.6.0.jar info solrdata/gettingstarted/data/index > Lucene index version: 7 > {code} > > 4. Stop Solr 7, update solrconfig.xml for Solr 8 and start a Solr 8 server > {code:java} > $ docker stop my_solr_7 > $ vim solrdata/gettingstarted/conf/solrconfig.xml > $ cat solrdata/gettingstarted/conf/solrconfig.xml | grep luceneMatchVersion > 8.11.2 > $ docker run -d -v "$PWD/solrdata:/var/solr/data:rw" -p 8983:8983 --name > my_solr_8 solr:8.11.2{code} > > 5. Check index is loaded ok and docs are still there > {code:java} > $ curl -s 'http://localhost:8983/solr/gettingstarted/select?q=*:*' | jq > .response.numFound > 11 {code} > > 6. Clear the index and check index files / version > {code:java} > $ curl -X POST -H 'Content-Type: application/json' > 'http://localhost:8983/solr/gettingstarted/update?commit=true' -d '{ > "delete": {"query":"*:*"} }' > $ ll solrdata/gettingstarted/data/index > total 4,0K > -rw-r--r--. 1 8983 8983 135 16 mars 17:45 segments_5 > -rw-r--r--. 1 8983 8983 0 16 mars 17:36 write.lock > $ java -jar luceneupgrader-0.6.0.jar info solrdata/gettingstarted/data/index > Lucene index version: 7 > $ curl 'http://localhost:8983/solr/gettingstarted/update?optimize=true' > $ ll solrdata/gettingstarted/data/index
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17949234#comment-17949234 ] Rahul Goswami commented on SOLR-17725: -- Submitted pull request for the Lucene API change. Fingers crossed! [https://github.com/apache/lucene/pull/14607] > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17758) NumFieldLimiting URP "warnOnly" mode broken
[ https://issues.apache.org/jira/browse/SOLR-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950600#comment-17950600 ] Rahul Goswami commented on SOLR-17758: -- Thanks for creating the JIRA Jason. Although I do see that due to the reason you mentioned, the chain would get terminated irrespective of whether warnOnly is true or false, since the user complained of getting a 400 error (SolrException.ErrorCode.BAD_REQUEST) the real culprit here seems to be this '>' check in init() . It should be ">=" I believe. https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorFactory.java#L72 > NumFieldLimiting URP "warnOnly" mode broken > --- > > Key: SOLR-17758 > URL: https://issues.apache.org/jira/browse/SOLR-17758 > Project: Solr > Issue Type: Bug > Components: UpdateRequestProcessors >Affects Versions: 9.8.1 >Reporter: Jason Gerlowski >Priority: Minor > > NumFieldLimitingUpdateProcessorFactory (introduced in SOLR-17192) aims to > offer a "warnOnly" mode that logs a warning when the maximum number of fields > is exceeded. > But the "warnOnly" code path doesn't trigger any subsequent processors in the > chain. So in effect, both modes will prevent new documents from being added > once the limit has been exceeded. > We should rework this logic so that the warnOnly=true codepath allows > documents to be indexed as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-7962) Passing additional arguments to solr.cmd using "-a" does not work on Windows
[ https://issues.apache.org/jira/browse/SOLR-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950433#comment-17950433 ] Rahul Goswami commented on SOLR-7962: - Sorry for dropping the ball on this. I am able to reproduce this on Windows. Passed --jvm-opts "-Dsolr.somerandomproperty=true" and I don't see it in the Java properties in Solr Admin UI. Same with --jvm-opts "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=18983". Working on a fix and a PR. > Passing additional arguments to solr.cmd using "-a" does not work on Windows > > > Key: SOLR-7962 > URL: https://issues.apache.org/jira/browse/SOLR-7962 > Project: Solr > Issue Type: Bug >Affects Versions: 5.3 >Reporter: Dawid Weiss >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-7962) Passing additional arguments to solr.cmd using "-a" does not work on Windows
[ https://issues.apache.org/jira/browse/SOLR-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950433#comment-17950433 ] Rahul Goswami edited comment on SOLR-7962 at 5/9/25 4:52 AM: - Sorry for dropping the ball on this. I am able to reproduce this on Windows. Tried solr start -e techproducts --jvm-opts "-Dsolr.somerandomproperty=true" and I don't see it in the Java properties in Solr Admin UI. Same with solr start -e techproducts --jvm-opts "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=18983". Working on a fix and a PR. was (Author: rahul196...@gmail.com): Sorry for dropping the ball on this. I am able to reproduce this on Windows. Passed --jvm-opts "-Dsolr.somerandomproperty=true" and I don't see it in the Java properties in Solr Admin UI. Same with --jvm-opts "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=18983". Working on a fix and a PR. > Passing additional arguments to solr.cmd using "-a" does not work on Windows > > > Key: SOLR-7962 > URL: https://issues.apache.org/jira/browse/SOLR-7962 > Project: Solr > Issue Type: Bug >Affects Versions: 5.3 >Reporter: Dawid Weiss >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-7962) Passing additional arguments to solr.cmd using "-a" does not work on Windows
[ https://issues.apache.org/jira/browse/SOLR-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950745#comment-17950745 ] Rahul Goswami edited comment on SOLR-7962 at 5/11/25 7:19 AM: -- Thanks for offering to review [~epugh] . The pull request is ready for review. Will add the tests next. "\-e" now works with "-jvm-opts" on Windows. Also fixed an edge case issue where passing a -D system property with --jvm-opts would break parsing. So now passing something like --jvm-opts "-Dsolr.myprops.custom=hello" works. Also tested passing multiple args like --jvm-opts "–Dsolr.myprops.custom=hello agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983" and that works too. was (Author: rahul196...@gmail.com): Thanks for offering to review [~epugh] . The pull request is ready for review. Will add the tests next. "-e" now works with "--jvm-opts" on Windows. Also fixed an edge case issue where passing a -D system property with --jvm-opts would break parsing. So now passing something like --jvm-opts "-Dsolr.myprops.custom=hello" works. Also tested passing multiple args like --jvm-opts "–Dsolr.myprops.custom=hello agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983" and that works too. > Passing additional arguments to solr.cmd using "-a" does not work on Windows > > > Key: SOLR-7962 > URL: https://issues.apache.org/jira/browse/SOLR-7962 > Project: Solr > Issue Type: Bug >Affects Versions: 5.3 >Reporter: Dawid Weiss >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-7962) Passing additional arguments to solr.cmd using "-a" does not work on Windows
[ https://issues.apache.org/jira/browse/SOLR-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950745#comment-17950745 ] Rahul Goswami edited comment on SOLR-7962 at 5/11/25 7:20 AM: -- Thanks for offering to review [~epugh] . The pull request is ready for review. Will add the tests next. "-e" now works with "-jvm-opts" on Windows. Also fixed an edge case issue where passing a -D system property with --jvm-opts would break parsing. So now passing something like --jvm-opts "-Dsolr.myprops.custom=hello" works. Also tested passing multiple args like --jvm-opts "-Dsolr.myprops.custom=hello agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983" and that works too. was (Author: rahul196...@gmail.com): Thanks for offering to review [~epugh] . The pull request is ready for review. Will add the tests next. "\-e" now works with "-jvm-opts" on Windows. Also fixed an edge case issue where passing a -D system property with --jvm-opts would break parsing. So now passing something like --jvm-opts "-Dsolr.myprops.custom=hello" works. Also tested passing multiple args like --jvm-opts "–Dsolr.myprops.custom=hello agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983" and that works too. > Passing additional arguments to solr.cmd using "-a" does not work on Windows > > > Key: SOLR-7962 > URL: https://issues.apache.org/jira/browse/SOLR-7962 > Project: Solr > Issue Type: Bug >Affects Versions: 5.3 >Reporter: Dawid Weiss >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-7962) Passing additional arguments to solr.cmd using "-a" does not work on Windows
[ https://issues.apache.org/jira/browse/SOLR-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950745#comment-17950745 ] Rahul Goswami commented on SOLR-7962: - Thanks for offering to review [~epugh] . The pull request is ready for review. Will add the tests next. "-e" now works with "--jvm-opts" on Windows. Also fixed an edge case issue where passing a -D system property with --jvm-opts would break parsing. So now passing something like --jvm-opts "-Dsolr.myprops.custom=hello" works. Also tested passing multiple args like --jvm-opts "–Dsolr.myprops.custom=hello agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983" and that works too. > Passing additional arguments to solr.cmd using "-a" does not work on Windows > > > Key: SOLR-7962 > URL: https://issues.apache.org/jira/browse/SOLR-7962 > Project: Solr > Issue Type: Bug >Affects Versions: 5.3 >Reporter: Dawid Weiss >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-7962) Passing additional arguments to solr.cmd using "-a" does not work on Windows
[ https://issues.apache.org/jira/browse/SOLR-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950745#comment-17950745 ] Rahul Goswami edited comment on SOLR-7962 at 5/11/25 7:32 AM: -- Thanks for offering to review [~epugh] . The pull request is ready for review. Will add the tests next. "-e" now works with "-jvm-opts" on Windows. For the specific case of remote debug config (-agentlib:jdwp=transport=...), cmd.exe was not playing well with commons-exec's default way of parsing, passing incorrect/incomplete values to start.cmd. Also fixed an edge case issue where passing a \-D system property as a value for --jvm-opts would cause the command to bail. So now passing something like --jvm-opts "-Dsolr.myprops.custom=hello" works. Also tested passing multiple args like --jvm-opts "-Dsolr.myprops.custom=hello -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983" and that works too. was (Author: rahul196...@gmail.com): Thanks for offering to review [~epugh] . The pull request is ready for review. Will add the tests next. "-e" now works with "-jvm-opts" on Windows. For the specific case of remote debug config (-agentlib:jdwp=transport=...), cmd.exe was not playing well with commons-exec's default way of parsing, passing incorrect/incomplete values to start.cmd. Also fixed an edge case issue where passing a -D system property as a value for--jvm-opts would cause the command to bail. So now passing something like --jvm-opts "-Dsolr.myprops.custom=hello" works. Also tested passing multiple args like --jvm-opts "-Dsolr.myprops.custom=hello -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983" and that works too. > Passing additional arguments to solr.cmd using "-a" does not work on Windows > > > Key: SOLR-7962 > URL: https://issues.apache.org/jira/browse/SOLR-7962 > Project: Solr > Issue Type: Bug >Affects Versions: 5.3 >Reporter: Dawid Weiss >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-7962) Passing additional arguments to solr.cmd using "-a" does not work on Windows
[ https://issues.apache.org/jira/browse/SOLR-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950745#comment-17950745 ] Rahul Goswami edited comment on SOLR-7962 at 5/11/25 7:30 AM: -- Thanks for offering to review [~epugh] . The pull request is ready for review. Will add the tests next. "-e" now works with "-jvm-opts" on Windows. For the specific case of remote debug config (-agentlib:jdwp=transport=...), cmd.exe was not playing well with commons-exec's default way of parsing, passing incorrect/incomplete values to start.cmd. Also fixed an edge case issue where passing a -D system property as a value for--jvm-opts would cause the command to bail. So now passing something like --jvm-opts "-Dsolr.myprops.custom=hello" works. Also tested passing multiple args like --jvm-opts "-Dsolr.myprops.custom=hello -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983" and that works too. was (Author: rahul196...@gmail.com): Thanks for offering to review [~epugh] . The pull request is ready for review. Will add the tests next. "-e" now works with "-jvm-opts" on Windows. Also fixed an edge case issue where passing a -D system property with --jvm-opts would break parsing. So now passing something like --jvm-opts "-Dsolr.myprops.custom=hello" works. Also tested passing multiple args like --jvm-opts "-Dsolr.myprops.custom=hello agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983" and that works too. > Passing additional arguments to solr.cmd using "-a" does not work on Windows > > > Key: SOLR-7962 > URL: https://issues.apache.org/jira/browse/SOLR-7962 > Project: Solr > Issue Type: Bug >Affects Versions: 5.3 >Reporter: Dawid Weiss >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-7962) Passing additional arguments to solr.cmd using "-a" does not work on Windows
[ https://issues.apache.org/jira/browse/SOLR-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951086#comment-17951086 ] Rahul Goswami edited comment on SOLR-7962 at 5/13/25 1:42 PM: -- Interesting find while running (main branch) on Linux. Passing multiple args in —jvm-opts as " -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983 -Dsolr.myprops.custom=hello" doesn’t work. Regression? Note that this now works on Windows with this fix. Might look into making this work for linux when I get a chance. was (Author: rahul196...@gmail.com): Interesting find while running (main branch) on Linux. Passing multiple args in —jvm-opts as " -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983 -Dsolr.myprops.custom=hello" doesn’t work. Regression? Note that this now works on Windows with this fix. Might look into making this work for linux when I get a chance. > Passing additional arguments to solr.cmd using "-a" does not work on Windows > > > Key: SOLR-7962 > URL: https://issues.apache.org/jira/browse/SOLR-7962 > Project: Solr > Issue Type: Bug >Affects Versions: 5.3 >Reporter: Dawid Weiss >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-7962) Passing additional arguments to solr.cmd using "-a" does not work on Windows
[ https://issues.apache.org/jira/browse/SOLR-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951086#comment-17951086 ] Rahul Goswami commented on SOLR-7962: - Interesting find while running (main branch) on Linux. Passing multiple args in —jvm-opts as " -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983 -Dsolr.myprops.custom=hello" doesn’t work. Regression? Note that this now works on Windows with this fix. Might look into making this work for linux when I get a chance. > Passing additional arguments to solr.cmd using "-a" does not work on Windows > > > Key: SOLR-7962 > URL: https://issues.apache.org/jira/browse/SOLR-7962 > Project: Solr > Issue Type: Bug >Affects Versions: 5.3 >Reporter: Dawid Weiss >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-7962) Passing additional arguments to solr.cmd using "-a" does not work on Windows
[ https://issues.apache.org/jira/browse/SOLR-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951086#comment-17951086 ] Rahul Goswami edited comment on SOLR-7962 at 5/14/25 1:55 PM: -- Interesting find while running (main branch) on Linux. Passing multiple args in —jvm-opts as " -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983 -Dsolr.myprops.custom=hello" doesn’t work. Maybe a regression introduced at some point in the past? Note that this now works on Windows with this fix. Might look into making this work for linux when I get a chance. was (Author: rahul196...@gmail.com): Interesting find while running (main branch) on Linux. Passing multiple args in —jvm-opts as " -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983 -Dsolr.myprops.custom=hello" doesn’t work. Regression? Note that this now works on Windows with this fix. Might look into making this work for linux when I get a chance. > Passing additional arguments to solr.cmd using "-a" does not work on Windows > > > Key: SOLR-7962 > URL: https://issues.apache.org/jira/browse/SOLR-7962 > Project: Solr > Issue Type: Bug >Affects Versions: 5.3 >Reporter: Dawid Weiss >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-17772) Tests for examples failing on Windows
Rahul Goswami created SOLR-17772: Summary: Tests for examples failing on Windows Key: SOLR-17772 URL: https://issues.apache.org/jira/browse/SOLR-17772 Project: Solr Issue Type: Bug Components: cli Reporter: Rahul Goswami This change only impacts _*tests*_ on Windows. Post the fix for jvm-opts, command line execution runs fine. The start flow via solr.cmd passes a "--script" parameter (which our tests don't) and uses a different executor inside RunExampleTool from what the tests use (RunExampleExecutor). Prior to recently merged fix for jvm-opts, because of these reasons, the tests on Windows would also try to prepare a command line with bin/solr (instead of bin/solr.cmd). Hence those tests would pass getting into the "if" block in this PR, although in an unintended way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17746) bin/solr always fails if you attempt to use --jettyconfig (aka "-j")
[ https://issues.apache.org/jira/browse/SOLR-17746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986145#comment-17986145 ] Rahul Goswami commented on SOLR-17746: -- [~hossman] FWIW passing multiple space separated args in --jvm-opts as shown below does work on Windows post the fix in https://issues.apache.org/jira/browse/SOLR-7962 --jvm-opts " -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983 -Dsolr.myprops.custom=hello" I remember it not working on Linux since the parsing in SolrCLI is different, but might need to check again. > bin/solr always fails if you attempt to use --jettyconfig (aka "-j") > > > Key: SOLR-17746 > URL: https://issues.apache.org/jira/browse/SOLR-17746 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > The "-jettyconfig" optiona (aka "-j") is documented as... > {noformat} > -j Additional parameters to pass to Jetty when starting > Solr. > For example, to add configuration folder that jetty > should read > you could pass: -j > "--include-jetty-dir=/etc/jetty/custom/server/" > In most cases, you should wrap the additional > parameters in double quotes. > {noformat} > ..but if you actually attempt to run use that example option, you will get > an error... > {noformat} > ./bin/solr start ... -j "--include-jetty-dir=/etc/jetty/custom/server/" > ERROR: Jetty config is required when using the -j option! > {noformat} > IIUC this is because the bash code for parsing this option requires that it > not start with a "{{\-}}" character; but by definition any option you want to > pass to jetty will start with "{{\--}}". > Attempting to workaround this problem by using two sets of quotes doesn't > seem to work -- the inner quotes are passed verbatim to jetty which seems to > prevent jetty from recognizing it as a valid option. > A workaround that *does* seem to work (in my limited testing) is to include a > leading space character _inside_ the quotes... > {noformat} > ./bin/solr start ... -j " --include-jetty-dir=/etc/jetty/custom/server/" > {noformat} > ...because for some reason that does *NOT* seem to be passed verbatim. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-17746) bin/solr always fails if you attempt to use --jettyconfig (aka "-j")
[ https://issues.apache.org/jira/browse/SOLR-17746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986145#comment-17986145 ] Rahul Goswami edited comment on SOLR-17746 at 6/25/25 1:57 PM: --- [~hossman] FWIW passing multiple space separated args in --jvm-opts as shown below **does** work on Windows post the fix in https://issues.apache.org/jira/browse/SOLR-7962 --jvm-opts " -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983 -Dsolr.myprops.custom=hello" I remember it not working on Linux since the parsing in SolrCLI is different, but might need to check again. was (Author: rahul196...@gmail.com): [~hossman] FWIW passing multiple space separated args in --jvm-opts as shown below does work on Windows post the fix in https://issues.apache.org/jira/browse/SOLR-7962 --jvm-opts " -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983 -Dsolr.myprops.custom=hello" I remember it not working on Linux since the parsing in SolrCLI is different, but might need to check again. > bin/solr always fails if you attempt to use --jettyconfig (aka "-j") > > > Key: SOLR-17746 > URL: https://issues.apache.org/jira/browse/SOLR-17746 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > The "-jettyconfig" optiona (aka "-j") is documented as... > {noformat} > -j Additional parameters to pass to Jetty when starting > Solr. > For example, to add configuration folder that jetty > should read > you could pass: -j > "--include-jetty-dir=/etc/jetty/custom/server/" > In most cases, you should wrap the additional > parameters in double quotes. > {noformat} > ..but if you actually attempt to run use that example option, you will get > an error... > {noformat} > ./bin/solr start ... -j "--include-jetty-dir=/etc/jetty/custom/server/" > ERROR: Jetty config is required when using the -j option! > {noformat} > IIUC this is because the bash code for parsing this option requires that it > not start with a "{{\-}}" character; but by definition any option you want to > pass to jetty will start with "{{\--}}". > Attempting to workaround this problem by using two sets of quotes doesn't > seem to work -- the inner quotes are passed verbatim to jetty which seems to > prevent jetty from recognizing it as a valid option. > A workaround that *does* seem to work (in my limited testing) is to include a > leading space character _inside_ the quotes... > {noformat} > ./bin/solr start ... -j " --include-jetty-dir=/etc/jetty/custom/server/" > {noformat} > ...because for some reason that does *NOT* seem to be passed verbatim. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-17746) bin/solr always fails if you attempt to use --jettyconfig (aka "-j")
[ https://issues.apache.org/jira/browse/SOLR-17746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986145#comment-17986145 ] Rahul Goswami edited comment on SOLR-17746 at 6/25/25 1:57 PM: --- [~hossman] FWIW passing multiple space separated args in --jvm-opts as shown below *does* work on Windows post the fix in https://issues.apache.org/jira/browse/SOLR-7962 --jvm-opts " -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983 -Dsolr.myprops.custom=hello" I remember it not working on Linux since the parsing in SolrCLI is different, but might need to check again. was (Author: rahul196...@gmail.com): [~hossman] FWIW passing multiple space separated args in --jvm-opts as shown below **does** work on Windows post the fix in https://issues.apache.org/jira/browse/SOLR-7962 --jvm-opts " -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:18983 -Dsolr.myprops.custom=hello" I remember it not working on Linux since the parsing in SolrCLI is different, but might need to check again. > bin/solr always fails if you attempt to use --jettyconfig (aka "-j") > > > Key: SOLR-17746 > URL: https://issues.apache.org/jira/browse/SOLR-17746 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > The "-jettyconfig" optiona (aka "-j") is documented as... > {noformat} > -j Additional parameters to pass to Jetty when starting > Solr. > For example, to add configuration folder that jetty > should read > you could pass: -j > "--include-jetty-dir=/etc/jetty/custom/server/" > In most cases, you should wrap the additional > parameters in double quotes. > {noformat} > ..but if you actually attempt to run use that example option, you will get > an error... > {noformat} > ./bin/solr start ... -j "--include-jetty-dir=/etc/jetty/custom/server/" > ERROR: Jetty config is required when using the -j option! > {noformat} > IIUC this is because the bash code for parsing this option requires that it > not start with a "{{\-}}" character; but by definition any option you want to > pass to jetty will start with "{{\--}}". > Attempting to workaround this problem by using two sets of quotes doesn't > seem to work -- the inner quotes are passed verbatim to jetty which seems to > prevent jetty from recognizing it as a valid option. > A workaround that *does* seem to work (in my limited testing) is to include a > leading space character _inside_ the quotes... > {noformat} > ./bin/solr start ... -j " --include-jetty-dir=/etc/jetty/custom/server/" > {noformat} > ...because for some reason that does *NOT* seem to be passed verbatim. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17772) Tests for examples failing on Windows
[ https://issues.apache.org/jira/browse/SOLR-17772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17966778#comment-17966778 ] Rahul Goswami commented on SOLR-17772: -- [~dsmiley] Yes this can be marked Resolved for 10. > Tests for examples failing on Windows > - > > Key: SOLR-17772 > URL: https://issues.apache.org/jira/browse/SOLR-17772 > Project: Solr > Issue Type: Bug > Components: cli >Reporter: Rahul Goswami >Priority: Minor > Labels: pull-request-available, windows > Time Spent: 10m > Remaining Estimate: 0h > > This change only impacts _*tests*_ on Windows. Post the fix for jvm-opts, > command line execution runs fine. > The start flow via solr.cmd passes a "--script" parameter (which our tests > don't) and uses a different executor inside RunExampleTool from what the > tests use (RunExampleExecutor). Prior to recently merged fix for jvm-opts, > because of these reasons, the tests on Windows would also try to prepare a > command line with bin/solr (instead of bin/solr.cmd). Hence those tests would > pass getting into the "if" block in this PR, although in an unintended way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17813) Add support for SeededKnnVectorQuery
[ https://issues.apache.org/jira/browse/SOLR-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18009044#comment-18009044 ] Rahul Goswami commented on SOLR-17813: -- I am working on this. Thanks for the initial draft [~cpoerschke] . I have built a good understanding of HNSW and prodding around the current Solr KnnQParser and Lucene SeededKnnVectorQuery to continue this effort. > Add support for SeededKnnVectorQuery > > > Key: SOLR-17813 > URL: https://issues.apache.org/jira/browse/SOLR-17813 > Project: Solr > Issue Type: New Feature > Components: vector-search >Reporter: Alessandro Benedetti >Priority: Major > > Apache Lucene implemented a version of knn vector query that provides a query > seed to initiate the vector search (entry points in the HNSW graph > exploration). > See "Lexically-Accelerated Dense Retrieval"(Hrishikesh Kulkarni, Sean > MacAvaney, Nazli Goharian, Ophir Frieder). > From SIGIR '23: https://arxiv.org/abs/2307.16779 > With this task, we aim to add to Solr this new query, probably as an > additional parameter of the current KNN query parser. > The only relevant parameter is Query seed > While the Weight seedWeight is added when rewriting the query, so no special > care should be needed there (see > org.apache.lucene.search.SeededKnnVectorQuery#rewrite and > org.apache.lucene.search.SeededKnnVectorQuery#createSeedWeight) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-17813) Add support for SeededKnnVectorQuery
[ https://issues.apache.org/jira/browse/SOLR-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18009044#comment-18009044 ] Rahul Goswami edited comment on SOLR-17813 at 7/22/25 4:43 PM: --- I am working on this. Thanks for the initial draft [~cpoerschke] . I have built a good understanding of HNSW and LADR, and prodding around the current Solr KnnQParser and Lucene SeededKnnVectorQuery to continue this effort. was (Author: rahul196...@gmail.com): I am working on this. Thanks for the initial draft [~cpoerschke] . I have built a good understanding of HNSW and prodding around the current Solr KnnQParser and Lucene SeededKnnVectorQuery to continue this effort. > Add support for SeededKnnVectorQuery > > > Key: SOLR-17813 > URL: https://issues.apache.org/jira/browse/SOLR-17813 > Project: Solr > Issue Type: New Feature > Components: vector-search >Reporter: Alessandro Benedetti >Priority: Major > > Apache Lucene implemented a version of knn vector query that provides a query > seed to initiate the vector search (entry points in the HNSW graph > exploration). > See "Lexically-Accelerated Dense Retrieval"(Hrishikesh Kulkarni, Sean > MacAvaney, Nazli Goharian, Ophir Frieder). > From SIGIR '23: https://arxiv.org/abs/2307.16779 > With this task, we aim to add to Solr this new query, probably as an > additional parameter of the current KNN query parser. > The only relevant parameter is Query seed > While the Weight seedWeight is added when rewriting the query, so no special > care should be needed there (see > org.apache.lucene.search.SeededKnnVectorQuery#rewrite and > org.apache.lucene.search.SeededKnnVectorQuery#createSeedWeight) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17813) Add support for SeededKnnVectorQuery
[ https://issues.apache.org/jira/browse/SOLR-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18010082#comment-18010082 ] Rahul Goswami commented on SOLR-17813: -- Waiting for Solr main to upgrade to Lucene 10.2.x where support for SeededKnnVectorQuery was introduced. > Add support for SeededKnnVectorQuery > > > Key: SOLR-17813 > URL: https://issues.apache.org/jira/browse/SOLR-17813 > Project: Solr > Issue Type: New Feature > Components: vector-search >Reporter: Alessandro Benedetti >Priority: Major > > Apache Lucene implemented a version of knn vector query that provides a query > seed to initiate the vector search (entry points in the HNSW graph > exploration). > See "Lexically-Accelerated Dense Retrieval"(Hrishikesh Kulkarni, Sean > MacAvaney, Nazli Goharian, Ophir Frieder). > From SIGIR '23: https://arxiv.org/abs/2307.16779 > With this task, we aim to add to Solr this new query, probably as an > additional parameter of the current KNN query parser. > The only relevant parameter is Query seed > While the Weight seedWeight is added when rewriting the query, so no special > care should be needed there (see > org.apache.lucene.search.SeededKnnVectorQuery#rewrite and > org.apache.lucene.search.SeededKnnVectorQuery#createSeedWeight) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org