Also, you don't explicitly mention, but just to be sure: in your Solr
6 deployment, were the `playlist_index_*` fields stored? Some degree
of performance difference between stored=true/false is expected,
because retrieving dynamic fields with many field names is definitely
problematic for docValues-only. SOLR-16989 was in fact designed to
mitigate this to some extent, but with 100k separate fields, this is
going to definitely be a problem, and I suspect it would have been
_more_ of a problem before SOLR-16989 ... but I don't want to jump to
conclusions!

On Wed, Feb 21, 2024 at 12:43 PM Michael Gibney
<mich...@michaelgibney.net> wrote:
>
> It might be worth looking at this issue:
> https://issues.apache.org/jira/browse/SOLR-16989
>
> The irony is that this issue was supposed to help with slowness in
> cases similar to what you describe. Can you send a full stack trace
> for a representative call to
> `DocValuesIteratorCache.newEntry(String)`?
>
> On Tue, Feb 20, 2024 at 11:36 PM Rahul Goswami <rahul196...@gmail.com> wrote:
> >
> > Did you happen to change the DirectoryFactory in solrconfig to
> > SimpleFSDirectoryFactory or NIOFSDirecotryFactory by any chance? Default is
> > Mmap which is much more performant for atomic updates (and also practical,
> > especially given the small(ish) size of your index).
> >
> > -Rahul
> >
> > On Tue, Feb 20, 2024 at 4:52 AM Christine Poerschke (BLOOMBERG/ LONDON) <
> > cpoersc...@bloomberg.net> wrote:
> >
> > > Hello Calvin,
> > >
> > > Thank you for this wonderful issue write-up!
> > >
> > > You mention upgrading from Solr 6 to 9.5 versions and I wonder if it might
> > > be practical or insightful to assess for some versions in between too e.g.
> > > 9.5/9.4/9.3 going backwards or 6/7/8/9.0 going forward or some sort of
> > > binary search variant.
> > >
> > > Best wishes,
> > > Christine
> > >
> > > From: users@solr.apache.org At: 02/16/24 17:59:59 UTCTo:
> > > solr-u...@lucene.apache.org
> > > Subject: Partial update slowness with a stored="false" dynamic field and
> > > lots of distinct field names
> > >
> > > Hi solr users,
> > >
> > > While tracking down a severe performance regression doing partial updates
> > > when upgrading from Solr 6 to solr 9.5.0, I discovered the following
> > > unexpected behavior.
> > >
> > > In my schema.xml file, I have the following fields (among many others):
> > >
> > > <field name="id" type="string" indexed="true" stored="true"
> > > required="true"/>
> > > <field name="_version_" type="long" indexed="false" stored="false"/>
> > > <field name="name" type="text_en_splitting_tight" indexed="true"
> > > stored="true" omitNorms="true"/>
> > >
> > > <dynamicField name="playlist_index_*" type="int" .../> <!-- The int field
> > > type has docValues="true" -->
> > >
> > > The unexpected impact on partial update performance depends on whether the
> > > dynamic field above is stored (with `docValues="true"`). The index in
> > > question contains about 30 million documents and is 15GB in size, and 
> > > there
> > > are a large number of distinct `playlist_index_*`` field names (more than
> > > 100K). The purpose of that dynamic field is to support sorting, with
> > > queries sometimes specifying a sort like `playlist_index_3141 asc` to sort
> > > the results by that field.
> > >
> > > For all the cases below, I added to the existing index the following 25 
> > > new
> > > documents:
> > >
> > > [
> > >   {"id": "12345678901234567891", "name": "."},
> > >   {"id": "12345678901234567892", "name": "."},
> > >   {"id": "12345678901234567893", "name": "."},
> > >   {"id": "12345678901234567894", "name": "."},
> > >   {"id": "12345678901234567895", "name": "."},
> > >   {"id": "12345678901234567896", "name": "."},
> > >   {"id": "12345678901234567897", "name": "."},
> > >   {"id": "12345678901234567898", "name": "."},
> > >   {"id": "12345678901234567899", "name": "."},
> > >   {"id": "12345678901234567900", "name": "."},
> > >   {"id": "12345678901234567901", "name": "."},
> > >   {"id": "12345678901234567902", "name": "."},
> > >   {"id": "12345678901234567903", "name": "."},
> > >   {"id": "12345678901234567904", "name": "."},
> > >   {"id": "12345678901234567995", "name": "."},
> > >   {"id": "12345678901234567996", "name": "."},
> > >   {"id": "12345678901234567997", "name": "."},
> > >   {"id": "12345678901234567998", "name": "."},
> > >   {"id": "12345678901234567999", "name": "."},
> > >   {"id": "12345678901234568000", "name": "."},
> > >   {"id": "12345678901234568001", "name": "."},
> > >   {"id": "12345678901234568002", "name": "."},
> > >   {"id": "12345678901234568003", "name": "."},
> > >   {"id": "12345678901234568004", "name": "."},
> > >   {"id": "12345678901234568005", "name": "."}
> > > ]
> > >
> > > and did both a soft and hard commit before continuing.
> > >
> > > Then I did a partial update with `commitWithin=600000` to update each of
> > > those documents:
> > >
> > > [
> > >   {"id": "12345678901234567891", "name": {"set": "1"}},
> > >   {"id": "12345678901234567892", "name": {"set": "2"}},
> > >   {"id": "12345678901234567893", "name": {"set": "3"}},
> > >   {"id": "12345678901234567894", "name": {"set": "4"}},
> > >   {"id": "12345678901234567895", "name": {"set": "5"}},
> > >   {"id": "12345678901234567896", "name": {"set": "6"}},
> > >   {"id": "12345678901234567897", "name": {"set": "7"}},
> > >   {"id": "12345678901234567898", "name": {"set": "8"}},
> > >   {"id": "12345678901234567899", "name": {"set": "9"}},
> > >   {"id": "12345678901234567900", "name": {"set": "10"}},
> > >   {"id": "12345678901234567901", "name": {"set": "11"}},
> > >   {"id": "12345678901234567902", "name": {"set": "12"}},
> > >   {"id": "12345678901234567903", "name": {"set": "13"}},
> > >   {"id": "12345678901234567904", "name": {"set": "14"}},
> > >   {"id": "12345678901234567995", "name": {"set": "15"}},
> > >   {"id": "12345678901234567996", "name": {"set": "16"}},
> > >   {"id": "12345678901234567997", "name": {"set": "17"}},
> > >   {"id": "12345678901234567998", "name": {"set": "18"}},
> > >   {"id": "12345678901234567999", "name": {"set": "19"}},
> > >   {"id": "12345678901234568000", "name": {"set": "20"}},
> > >   {"id": "12345678901234568001", "name": {"set": "21"}},
> > >   {"id": "12345678901234568002", "name": {"set": "22"}},
> > >   {"id": "12345678901234568003", "name": {"set": "23"}},
> > >   {"id": "12345678901234568004", "name": {"set": "24"}},
> > >   {"id": "12345678901234568005", "name": {"set": "25"}}
> > > ]
> > >
> > > The time it takes to perform the update varies drastically depending on
> > > whether the `playlist_index_*` dynamic field is stored:
> > >
> > > - 0.017s: <dynamicField name="playlist_index_*" type="int" indexed="true"
> > >  stored="true"  docValues="true"/>
> > > - 0.016s: <dynamicField name="playlist_index_*" type="int" indexed="false"
> > > stored="true"  docValues="true"/>
> > > - 8.850s: <dynamicField name="playlist_index_*" type="int" indexed="false"
> > > stored="false" docValues="true"/>
> > > - 8.867s: <dynamicField name="playlist_index_*" type="int" indexed="true"
> > >  stored="false" docValues="true"/>
> > >
> > > The surprise is the poor performance of the last two, when stored is false
> > > and so the partial update may need to use the docValues to generate the 
> > > new
> > > doc, compared to the first two when stored is true.
> > >
> > > When I profiled the code for the last of the four settings above, I saw
> > > that the majority of the time was spent in
> > >
> > > `org.apache.solr.search.SolrDocumentFetcher.decorateDocValueFields(SolrDocumentB
> > > ase,
> > > int, Set, DocValuesIteratorCache)` and further down in the callees under
> > > that hot spot there is a call to
> > > `org.apache.solr.search.DocValuesIteratorCache.newEntry(String)` that I
> > > added some print statements to in order to see which fields `newEntry` was
> > > being called for. There were many thousands of calls to the
> > > `playlist_index_*` fields, with each call being a different field name, 
> > > and
> > > no other fields.
> > >
> > > My schema does have other dynamic fields, but none of them resulted in
> > > calls to `newEntry`. What is distinct about this dynamic field is that 
> > > it's
> > > the only one that is a numeric field type and the only one that has more
> > > than a few hundred or so distinct field names.
> > >
> > > Does this seem like expected behavior that solr has to make so many
> > > `newEntry` calls for that dynamic field to perform the update that it
> > > seriously impacts update performance?
> > >
> > > Thanks for your time,
> > > Calvin
> > >
> > >
> > >

Reply via email to