It might be worth looking at this issue: https://issues.apache.org/jira/browse/SOLR-16989
The irony is that this issue was supposed to help with slowness in cases similar to what you describe. Can you send a full stack trace for a representative call to `DocValuesIteratorCache.newEntry(String)`? On Tue, Feb 20, 2024 at 11:36 PM Rahul Goswami <rahul196...@gmail.com> wrote: > > Did you happen to change the DirectoryFactory in solrconfig to > SimpleFSDirectoryFactory or NIOFSDirecotryFactory by any chance? Default is > Mmap which is much more performant for atomic updates (and also practical, > especially given the small(ish) size of your index). > > -Rahul > > On Tue, Feb 20, 2024 at 4:52 AM Christine Poerschke (BLOOMBERG/ LONDON) < > cpoersc...@bloomberg.net> wrote: > > > Hello Calvin, > > > > Thank you for this wonderful issue write-up! > > > > You mention upgrading from Solr 6 to 9.5 versions and I wonder if it might > > be practical or insightful to assess for some versions in between too e.g. > > 9.5/9.4/9.3 going backwards or 6/7/8/9.0 going forward or some sort of > > binary search variant. > > > > Best wishes, > > Christine > > > > From: users@solr.apache.org At: 02/16/24 17:59:59 UTCTo: > > solr-u...@lucene.apache.org > > Subject: Partial update slowness with a stored="false" dynamic field and > > lots of distinct field names > > > > Hi solr users, > > > > While tracking down a severe performance regression doing partial updates > > when upgrading from Solr 6 to solr 9.5.0, I discovered the following > > unexpected behavior. > > > > In my schema.xml file, I have the following fields (among many others): > > > > <field name="id" type="string" indexed="true" stored="true" > > required="true"/> > > <field name="_version_" type="long" indexed="false" stored="false"/> > > <field name="name" type="text_en_splitting_tight" indexed="true" > > stored="true" omitNorms="true"/> > > > > <dynamicField name="playlist_index_*" type="int" .../> <!-- The int field > > type has docValues="true" --> > > > > The unexpected impact on partial update performance depends on whether the > > dynamic field above is stored (with `docValues="true"`). The index in > > question contains about 30 million documents and is 15GB in size, and there > > are a large number of distinct `playlist_index_*`` field names (more than > > 100K). The purpose of that dynamic field is to support sorting, with > > queries sometimes specifying a sort like `playlist_index_3141 asc` to sort > > the results by that field. > > > > For all the cases below, I added to the existing index the following 25 new > > documents: > > > > [ > > {"id": "12345678901234567891", "name": "."}, > > {"id": "12345678901234567892", "name": "."}, > > {"id": "12345678901234567893", "name": "."}, > > {"id": "12345678901234567894", "name": "."}, > > {"id": "12345678901234567895", "name": "."}, > > {"id": "12345678901234567896", "name": "."}, > > {"id": "12345678901234567897", "name": "."}, > > {"id": "12345678901234567898", "name": "."}, > > {"id": "12345678901234567899", "name": "."}, > > {"id": "12345678901234567900", "name": "."}, > > {"id": "12345678901234567901", "name": "."}, > > {"id": "12345678901234567902", "name": "."}, > > {"id": "12345678901234567903", "name": "."}, > > {"id": "12345678901234567904", "name": "."}, > > {"id": "12345678901234567995", "name": "."}, > > {"id": "12345678901234567996", "name": "."}, > > {"id": "12345678901234567997", "name": "."}, > > {"id": "12345678901234567998", "name": "."}, > > {"id": "12345678901234567999", "name": "."}, > > {"id": "12345678901234568000", "name": "."}, > > {"id": "12345678901234568001", "name": "."}, > > {"id": "12345678901234568002", "name": "."}, > > {"id": "12345678901234568003", "name": "."}, > > {"id": "12345678901234568004", "name": "."}, > > {"id": "12345678901234568005", "name": "."} > > ] > > > > and did both a soft and hard commit before continuing. > > > > Then I did a partial update with `commitWithin=600000` to update each of > > those documents: > > > > [ > > {"id": "12345678901234567891", "name": {"set": "1"}}, > > {"id": "12345678901234567892", "name": {"set": "2"}}, > > {"id": "12345678901234567893", "name": {"set": "3"}}, > > {"id": "12345678901234567894", "name": {"set": "4"}}, > > {"id": "12345678901234567895", "name": {"set": "5"}}, > > {"id": "12345678901234567896", "name": {"set": "6"}}, > > {"id": "12345678901234567897", "name": {"set": "7"}}, > > {"id": "12345678901234567898", "name": {"set": "8"}}, > > {"id": "12345678901234567899", "name": {"set": "9"}}, > > {"id": "12345678901234567900", "name": {"set": "10"}}, > > {"id": "12345678901234567901", "name": {"set": "11"}}, > > {"id": "12345678901234567902", "name": {"set": "12"}}, > > {"id": "12345678901234567903", "name": {"set": "13"}}, > > {"id": "12345678901234567904", "name": {"set": "14"}}, > > {"id": "12345678901234567995", "name": {"set": "15"}}, > > {"id": "12345678901234567996", "name": {"set": "16"}}, > > {"id": "12345678901234567997", "name": {"set": "17"}}, > > {"id": "12345678901234567998", "name": {"set": "18"}}, > > {"id": "12345678901234567999", "name": {"set": "19"}}, > > {"id": "12345678901234568000", "name": {"set": "20"}}, > > {"id": "12345678901234568001", "name": {"set": "21"}}, > > {"id": "12345678901234568002", "name": {"set": "22"}}, > > {"id": "12345678901234568003", "name": {"set": "23"}}, > > {"id": "12345678901234568004", "name": {"set": "24"}}, > > {"id": "12345678901234568005", "name": {"set": "25"}} > > ] > > > > The time it takes to perform the update varies drastically depending on > > whether the `playlist_index_*` dynamic field is stored: > > > > - 0.017s: <dynamicField name="playlist_index_*" type="int" indexed="true" > > stored="true" docValues="true"/> > > - 0.016s: <dynamicField name="playlist_index_*" type="int" indexed="false" > > stored="true" docValues="true"/> > > - 8.850s: <dynamicField name="playlist_index_*" type="int" indexed="false" > > stored="false" docValues="true"/> > > - 8.867s: <dynamicField name="playlist_index_*" type="int" indexed="true" > > stored="false" docValues="true"/> > > > > The surprise is the poor performance of the last two, when stored is false > > and so the partial update may need to use the docValues to generate the new > > doc, compared to the first two when stored is true. > > > > When I profiled the code for the last of the four settings above, I saw > > that the majority of the time was spent in > > > > `org.apache.solr.search.SolrDocumentFetcher.decorateDocValueFields(SolrDocumentB > > ase, > > int, Set, DocValuesIteratorCache)` and further down in the callees under > > that hot spot there is a call to > > `org.apache.solr.search.DocValuesIteratorCache.newEntry(String)` that I > > added some print statements to in order to see which fields `newEntry` was > > being called for. There were many thousands of calls to the > > `playlist_index_*` fields, with each call being a different field name, and > > no other fields. > > > > My schema does have other dynamic fields, but none of them resulted in > > calls to `newEntry`. What is distinct about this dynamic field is that it's > > the only one that is a numeric field type and the only one that has more > > than a few hundred or so distinct field names. > > > > Does this seem like expected behavior that solr has to make so many > > `newEntry` calls for that dynamic field to perform the update that it > > seriously impacts update performance? > > > > Thanks for your time, > > Calvin > > > > > >