Did you happen to change the DirectoryFactory in solrconfig to SimpleFSDirectoryFactory or NIOFSDirecotryFactory by any chance? Default is Mmap which is much more performant for atomic updates (and also practical, especially given the small(ish) size of your index).
-Rahul On Tue, Feb 20, 2024 at 4:52 AM Christine Poerschke (BLOOMBERG/ LONDON) < cpoersc...@bloomberg.net> wrote: > Hello Calvin, > > Thank you for this wonderful issue write-up! > > You mention upgrading from Solr 6 to 9.5 versions and I wonder if it might > be practical or insightful to assess for some versions in between too e.g. > 9.5/9.4/9.3 going backwards or 6/7/8/9.0 going forward or some sort of > binary search variant. > > Best wishes, > Christine > > From: users@solr.apache.org At: 02/16/24 17:59:59 UTCTo: > solr-u...@lucene.apache.org > Subject: Partial update slowness with a stored="false" dynamic field and > lots of distinct field names > > Hi solr users, > > While tracking down a severe performance regression doing partial updates > when upgrading from Solr 6 to solr 9.5.0, I discovered the following > unexpected behavior. > > In my schema.xml file, I have the following fields (among many others): > > <field name="id" type="string" indexed="true" stored="true" > required="true"/> > <field name="_version_" type="long" indexed="false" stored="false"/> > <field name="name" type="text_en_splitting_tight" indexed="true" > stored="true" omitNorms="true"/> > > <dynamicField name="playlist_index_*" type="int" .../> <!-- The int field > type has docValues="true" --> > > The unexpected impact on partial update performance depends on whether the > dynamic field above is stored (with `docValues="true"`). The index in > question contains about 30 million documents and is 15GB in size, and there > are a large number of distinct `playlist_index_*`` field names (more than > 100K). The purpose of that dynamic field is to support sorting, with > queries sometimes specifying a sort like `playlist_index_3141 asc` to sort > the results by that field. > > For all the cases below, I added to the existing index the following 25 new > documents: > > [ > {"id": "12345678901234567891", "name": "."}, > {"id": "12345678901234567892", "name": "."}, > {"id": "12345678901234567893", "name": "."}, > {"id": "12345678901234567894", "name": "."}, > {"id": "12345678901234567895", "name": "."}, > {"id": "12345678901234567896", "name": "."}, > {"id": "12345678901234567897", "name": "."}, > {"id": "12345678901234567898", "name": "."}, > {"id": "12345678901234567899", "name": "."}, > {"id": "12345678901234567900", "name": "."}, > {"id": "12345678901234567901", "name": "."}, > {"id": "12345678901234567902", "name": "."}, > {"id": "12345678901234567903", "name": "."}, > {"id": "12345678901234567904", "name": "."}, > {"id": "12345678901234567995", "name": "."}, > {"id": "12345678901234567996", "name": "."}, > {"id": "12345678901234567997", "name": "."}, > {"id": "12345678901234567998", "name": "."}, > {"id": "12345678901234567999", "name": "."}, > {"id": "12345678901234568000", "name": "."}, > {"id": "12345678901234568001", "name": "."}, > {"id": "12345678901234568002", "name": "."}, > {"id": "12345678901234568003", "name": "."}, > {"id": "12345678901234568004", "name": "."}, > {"id": "12345678901234568005", "name": "."} > ] > > and did both a soft and hard commit before continuing. > > Then I did a partial update with `commitWithin=600000` to update each of > those documents: > > [ > {"id": "12345678901234567891", "name": {"set": "1"}}, > {"id": "12345678901234567892", "name": {"set": "2"}}, > {"id": "12345678901234567893", "name": {"set": "3"}}, > {"id": "12345678901234567894", "name": {"set": "4"}}, > {"id": "12345678901234567895", "name": {"set": "5"}}, > {"id": "12345678901234567896", "name": {"set": "6"}}, > {"id": "12345678901234567897", "name": {"set": "7"}}, > {"id": "12345678901234567898", "name": {"set": "8"}}, > {"id": "12345678901234567899", "name": {"set": "9"}}, > {"id": "12345678901234567900", "name": {"set": "10"}}, > {"id": "12345678901234567901", "name": {"set": "11"}}, > {"id": "12345678901234567902", "name": {"set": "12"}}, > {"id": "12345678901234567903", "name": {"set": "13"}}, > {"id": "12345678901234567904", "name": {"set": "14"}}, > {"id": "12345678901234567995", "name": {"set": "15"}}, > {"id": "12345678901234567996", "name": {"set": "16"}}, > {"id": "12345678901234567997", "name": {"set": "17"}}, > {"id": "12345678901234567998", "name": {"set": "18"}}, > {"id": "12345678901234567999", "name": {"set": "19"}}, > {"id": "12345678901234568000", "name": {"set": "20"}}, > {"id": "12345678901234568001", "name": {"set": "21"}}, > {"id": "12345678901234568002", "name": {"set": "22"}}, > {"id": "12345678901234568003", "name": {"set": "23"}}, > {"id": "12345678901234568004", "name": {"set": "24"}}, > {"id": "12345678901234568005", "name": {"set": "25"}} > ] > > The time it takes to perform the update varies drastically depending on > whether the `playlist_index_*` dynamic field is stored: > > - 0.017s: <dynamicField name="playlist_index_*" type="int" indexed="true" > stored="true" docValues="true"/> > - 0.016s: <dynamicField name="playlist_index_*" type="int" indexed="false" > stored="true" docValues="true"/> > - 8.850s: <dynamicField name="playlist_index_*" type="int" indexed="false" > stored="false" docValues="true"/> > - 8.867s: <dynamicField name="playlist_index_*" type="int" indexed="true" > stored="false" docValues="true"/> > > The surprise is the poor performance of the last two, when stored is false > and so the partial update may need to use the docValues to generate the new > doc, compared to the first two when stored is true. > > When I profiled the code for the last of the four settings above, I saw > that the majority of the time was spent in > > `org.apache.solr.search.SolrDocumentFetcher.decorateDocValueFields(SolrDocumentB > ase, > int, Set, DocValuesIteratorCache)` and further down in the callees under > that hot spot there is a call to > `org.apache.solr.search.DocValuesIteratorCache.newEntry(String)` that I > added some print statements to in order to see which fields `newEntry` was > being called for. There were many thousands of calls to the > `playlist_index_*` fields, with each call being a different field name, and > no other fields. > > My schema does have other dynamic fields, but none of them resulted in > calls to `newEntry`. What is distinct about this dynamic field is that it's > the only one that is a numeric field type and the only one that has more > than a few hundred or so distinct field names. > > Does this seem like expected behavior that solr has to make so many > `newEntry` calls for that dynamic field to perform the update that it > seriously impacts update performance? > > Thanks for your time, > Calvin > > >