Also, you don't explicitly mention, but just to be sure: in your Solr 6 deployment, were the `playlist_index_*` fields stored? Some degree of performance difference between stored=true/false is expected, because retrieving dynamic fields with many field names is definitely problematic for docValues-only. SOLR-16989 was in fact designed to mitigate this to some extent, but with 100k separate fields, this is going to definitely be a problem, and I suspect it would have been _more_ of a problem before SOLR-16989 ... but I don't want to jump to conclusions!
On Wed, Feb 21, 2024 at 12:43 PM Michael Gibney <mich...@michaelgibney.net> wrote: > > It might be worth looking at this issue: > https://issues.apache.org/jira/browse/SOLR-16989 > > The irony is that this issue was supposed to help with slowness in > cases similar to what you describe. Can you send a full stack trace > for a representative call to > `DocValuesIteratorCache.newEntry(String)`? > > On Tue, Feb 20, 2024 at 11:36 PM Rahul Goswami <rahul196...@gmail.com> wrote: > > > > Did you happen to change the DirectoryFactory in solrconfig to > > SimpleFSDirectoryFactory or NIOFSDirecotryFactory by any chance? Default is > > Mmap which is much more performant for atomic updates (and also practical, > > especially given the small(ish) size of your index). > > > > -Rahul > > > > On Tue, Feb 20, 2024 at 4:52 AM Christine Poerschke (BLOOMBERG/ LONDON) < > > cpoersc...@bloomberg.net> wrote: > > > > > Hello Calvin, > > > > > > Thank you for this wonderful issue write-up! > > > > > > You mention upgrading from Solr 6 to 9.5 versions and I wonder if it might > > > be practical or insightful to assess for some versions in between too e.g. > > > 9.5/9.4/9.3 going backwards or 6/7/8/9.0 going forward or some sort of > > > binary search variant. > > > > > > Best wishes, > > > Christine > > > > > > From: users@solr.apache.org At: 02/16/24 17:59:59 UTCTo: > > > solr-u...@lucene.apache.org > > > Subject: Partial update slowness with a stored="false" dynamic field and > > > lots of distinct field names > > > > > > Hi solr users, > > > > > > While tracking down a severe performance regression doing partial updates > > > when upgrading from Solr 6 to solr 9.5.0, I discovered the following > > > unexpected behavior. > > > > > > In my schema.xml file, I have the following fields (among many others): > > > > > > <field name="id" type="string" indexed="true" stored="true" > > > required="true"/> > > > <field name="_version_" type="long" indexed="false" stored="false"/> > > > <field name="name" type="text_en_splitting_tight" indexed="true" > > > stored="true" omitNorms="true"/> > > > > > > <dynamicField name="playlist_index_*" type="int" .../> <!-- The int field > > > type has docValues="true" --> > > > > > > The unexpected impact on partial update performance depends on whether the > > > dynamic field above is stored (with `docValues="true"`). The index in > > > question contains about 30 million documents and is 15GB in size, and > > > there > > > are a large number of distinct `playlist_index_*`` field names (more than > > > 100K). The purpose of that dynamic field is to support sorting, with > > > queries sometimes specifying a sort like `playlist_index_3141 asc` to sort > > > the results by that field. > > > > > > For all the cases below, I added to the existing index the following 25 > > > new > > > documents: > > > > > > [ > > > {"id": "12345678901234567891", "name": "."}, > > > {"id": "12345678901234567892", "name": "."}, > > > {"id": "12345678901234567893", "name": "."}, > > > {"id": "12345678901234567894", "name": "."}, > > > {"id": "12345678901234567895", "name": "."}, > > > {"id": "12345678901234567896", "name": "."}, > > > {"id": "12345678901234567897", "name": "."}, > > > {"id": "12345678901234567898", "name": "."}, > > > {"id": "12345678901234567899", "name": "."}, > > > {"id": "12345678901234567900", "name": "."}, > > > {"id": "12345678901234567901", "name": "."}, > > > {"id": "12345678901234567902", "name": "."}, > > > {"id": "12345678901234567903", "name": "."}, > > > {"id": "12345678901234567904", "name": "."}, > > > {"id": "12345678901234567995", "name": "."}, > > > {"id": "12345678901234567996", "name": "."}, > > > {"id": "12345678901234567997", "name": "."}, > > > {"id": "12345678901234567998", "name": "."}, > > > {"id": "12345678901234567999", "name": "."}, > > > {"id": "12345678901234568000", "name": "."}, > > > {"id": "12345678901234568001", "name": "."}, > > > {"id": "12345678901234568002", "name": "."}, > > > {"id": "12345678901234568003", "name": "."}, > > > {"id": "12345678901234568004", "name": "."}, > > > {"id": "12345678901234568005", "name": "."} > > > ] > > > > > > and did both a soft and hard commit before continuing. > > > > > > Then I did a partial update with `commitWithin=600000` to update each of > > > those documents: > > > > > > [ > > > {"id": "12345678901234567891", "name": {"set": "1"}}, > > > {"id": "12345678901234567892", "name": {"set": "2"}}, > > > {"id": "12345678901234567893", "name": {"set": "3"}}, > > > {"id": "12345678901234567894", "name": {"set": "4"}}, > > > {"id": "12345678901234567895", "name": {"set": "5"}}, > > > {"id": "12345678901234567896", "name": {"set": "6"}}, > > > {"id": "12345678901234567897", "name": {"set": "7"}}, > > > {"id": "12345678901234567898", "name": {"set": "8"}}, > > > {"id": "12345678901234567899", "name": {"set": "9"}}, > > > {"id": "12345678901234567900", "name": {"set": "10"}}, > > > {"id": "12345678901234567901", "name": {"set": "11"}}, > > > {"id": "12345678901234567902", "name": {"set": "12"}}, > > > {"id": "12345678901234567903", "name": {"set": "13"}}, > > > {"id": "12345678901234567904", "name": {"set": "14"}}, > > > {"id": "12345678901234567995", "name": {"set": "15"}}, > > > {"id": "12345678901234567996", "name": {"set": "16"}}, > > > {"id": "12345678901234567997", "name": {"set": "17"}}, > > > {"id": "12345678901234567998", "name": {"set": "18"}}, > > > {"id": "12345678901234567999", "name": {"set": "19"}}, > > > {"id": "12345678901234568000", "name": {"set": "20"}}, > > > {"id": "12345678901234568001", "name": {"set": "21"}}, > > > {"id": "12345678901234568002", "name": {"set": "22"}}, > > > {"id": "12345678901234568003", "name": {"set": "23"}}, > > > {"id": "12345678901234568004", "name": {"set": "24"}}, > > > {"id": "12345678901234568005", "name": {"set": "25"}} > > > ] > > > > > > The time it takes to perform the update varies drastically depending on > > > whether the `playlist_index_*` dynamic field is stored: > > > > > > - 0.017s: <dynamicField name="playlist_index_*" type="int" indexed="true" > > > stored="true" docValues="true"/> > > > - 0.016s: <dynamicField name="playlist_index_*" type="int" indexed="false" > > > stored="true" docValues="true"/> > > > - 8.850s: <dynamicField name="playlist_index_*" type="int" indexed="false" > > > stored="false" docValues="true"/> > > > - 8.867s: <dynamicField name="playlist_index_*" type="int" indexed="true" > > > stored="false" docValues="true"/> > > > > > > The surprise is the poor performance of the last two, when stored is false > > > and so the partial update may need to use the docValues to generate the > > > new > > > doc, compared to the first two when stored is true. > > > > > > When I profiled the code for the last of the four settings above, I saw > > > that the majority of the time was spent in > > > > > > `org.apache.solr.search.SolrDocumentFetcher.decorateDocValueFields(SolrDocumentB > > > ase, > > > int, Set, DocValuesIteratorCache)` and further down in the callees under > > > that hot spot there is a call to > > > `org.apache.solr.search.DocValuesIteratorCache.newEntry(String)` that I > > > added some print statements to in order to see which fields `newEntry` was > > > being called for. There were many thousands of calls to the > > > `playlist_index_*` fields, with each call being a different field name, > > > and > > > no other fields. > > > > > > My schema does have other dynamic fields, but none of them resulted in > > > calls to `newEntry`. What is distinct about this dynamic field is that > > > it's > > > the only one that is a numeric field type and the only one that has more > > > than a few hundred or so distinct field names. > > > > > > Does this seem like expected behavior that solr has to make so many > > > `newEntry` calls for that dynamic field to perform the update that it > > > seriously impacts update performance? > > > > > > Thanks for your time, > > > Calvin > > > > > > > > >