Hi, sorry for thee delay in replying. After some more digging, I noticed the following in the schema (which I didn't originally created and which works without apparent issues in Solr 7.5):
<dynamicField name="*" type="string" indexed="false" stored="false"/> I think that was intended as a catchall field for fields in the input data not found in the schema. Removing that field would stop producing duplicate documents with the same unique id. Without removing that field, adding the following: <field name="_root_" type="string" indexed="true" stored="false" docValues="false"/> also prevents the creation of duplicates. So to clarify. In *Solr 7.5*, the following: <dynamicField name="*" type="string" indexed="false" stored="false"/> (with no <field name="_root_" type="string" indexed="true" stored="false" docValues="false" />) When running: curl -X POST -H 'Content-type:application/json' ' http://localhost:8983/solr/test-dup/update?commit=true' --data "{'add': {'doc':{'id': '28a6a45a-5f81...', 'title': 'hello'}}}" curl -X POST -H 'Content-type:application/json' ' http://localhost:8983/solr/test-dup/update?commit=true' --data "{'add': {'doc':{'id': '28a6a45a-5f81...', 'title': 'bye'}}}" it results in a single doc: {"id": "28a6a45a-5f81...", "title": "bye"} In *Solr 8.11*, with the following in the schema: <dynamicField name="*" type="string" indexed="false" stored="false"/> (with no <field name="_root_" type="string" indexed="true" stored="false" docValues="false" />) When running: curl -X POST -H 'Content-type:application/json' ' http://localhost:8983/solr/test-dup/update?commit=true' --data "{'add': {'doc':{'id': '28a6a45a-5f81...', 'title': 'hello'}}}" curl -X POST -H 'Content-type:application/json' ' http://localhost:8983/solr/test-dup/update?commit=true' --data "{'add': {'doc':{'id': '28a6a45a-5f81...', 'title': 'bye'}}}" it results in two docs: {"id": "28a6a45a-5f81...", "title": "hello"} {"id": "28a6a45a-5f81...", "title": "bye"} As I mentioned above, removing the dynamic field or adding the _root_ field produces the expected behaviour with the document getting updated instead of duplicated. I have added uploaded a very stripped down version of the schema.xml <https://pastebin.com/raw/bxBBF2tP> plus the solrconfig.xml <https://pastebin.com/raw/aUsh9g2z> that reproduce the duplicating behaviour. Thanks! Eduardo On Fri, Dec 9, 2022 at 1:53 PM Jan Høydahl <[email protected]> wrote: > No no. The schema still has ONE a uniqueId field. > The _root_ field is used as a parent pointer for child documents, it will > hold the ID of the parent. > Thus you should not need _root_ if you don't use parent/child. But this > thread suggests that _root_ may be needed in some other code paths as well. > > I suspect perhaps this JIRA > https://issues.apache.org/jira/browse/SOLR-12638 may be related in some > way (have not looked at any of that code though, see > https://github.com/apache/solr/search?q=SOLR-12638&type=commits) > > Jan > > > 9. des. 2022 kl. 14:32 skrev Dave <[email protected]>: > > > > So it was a decision to remove the unique field id and replace it with > root? This seems, bad. You can’t have two documents with the same id/unique > field. > > > >> On Dec 9, 2022, at 7:57 AM, Jan Høydahl <[email protected]> wrote: > >> > >> Hi, > >> > >> So to be clear - you have a working fix by adding the _root_ field to > your schema? > >> > >> I suppose most 8.x users already have a _root_ field, so the thing you > are seeing could very well be some bug related to atomic update. > >> > >> Can I propose that you create a minimal reproduction of this issue and > upload somewhere? > >> It could e.g. be a set of curl commands that, starting from a newly > installed Solr 8.11 (or even better 9.1) reproduce the issue. > >> Hint: You can create a collection with default schema: `solr create -c > test` and then remove the _root_ field by issuing a delete-field command as > described here > https://solr.apache.org/guide/solr/latest/indexing-guide/schema-api.html#delete-a-field > >> > >> Jan > >> > >>>> 8. des. 2022 kl. 15:30 skrev Eduardo Gomez <[email protected] > >: > >>>> > >>>> At first it wasn't clear to me what the problem you're having actually > >>>> is. Then I glanced back at the message subject ... it is the only > place > >>>> you mention it. > >>> > >>> Sorry Shawn, you are right, I didn't explain very clearly. So > basically, in > >>> Solr 8.11.1, I can see that updating an existing document, e.g. {"id": > >>> "22468d41-3b...", "title": "Old title"}: > >>> > >>> curl -X POST -H 'Content-type:application/json' ' > >>> http://localhost:8983/solr/clients_main/update?commit=true' --data > "{'add': > >>> {'doc':{'id': '22468d41-3b...', 'title': 'New title'}}}" > >>> > >>> I get two docs with the same id and different titles in the index. > That is > >>> different to the behaviour I see using Solr 7.5, which is a single > document > >>> with the updated title.To get that with the same schema in Solr > 8.11.1, I > >>> have to add this to the schema: > >>> > >>> <field name="_root_" type="string" indexed="true" stored="false"> > >>> > >>> So without the _root_ definition, the behaviour is as expected in Solr > 7.5 > >>> but produces duplicate documents in Solr 8.11. I haven't noticed Solr > >>> complainig if the _root_ field is not defined. > >>> > >>> So my question was if that is expected, as that field seems to be > related > >>> to parent-child documents, which I don't use at all. > >>> > >>> The definition for the id field in my schema.xml is similar to the one > you > >>> posted: > >>> > >>> <fieldType name="string" class="solr.StrField" sortMissingLast="true"/> > >>> <field name="id" type="string" indexed="true" stored="true" > required="true" > >>> docValues="false"/> > >>> <uniqueKey>id</uniqueKey> > >>> > >>> Eduardo > >>> > >>> > >>> > >>> > >>> > >>> > >>>> On Thu, Dec 8, 2022 at 1:11 PM Mikhail Khludnev <[email protected]> > wrote: > >>>> > >>>> Right, Shawn. That's how it works > >>>> > >>>> > https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#updateDocuments-org.apache.lucene.index.Term-java.lang.Iterable- > >>>> And it's really fast in query time. > >>>> > >>>>> On Thu, Dec 8, 2022 at 4:06 PM Shawn Heisey <[email protected]> > wrote: > >>>> > >>>>> On 12/8/22 05:58, Shawn Heisey wrote: > >>>>>> So you can't just update a child document, you have to update all > the > >>>>>> children and all the parents at the same time, so the new documents > >>>>>> are all in the same segment. > >>>>> > >>>>> That's a little unclear and sounds like a draconian requirement. :) > I > >>>>> meant that all children must be in the same segment as their > parent. I > >>>>> think Solr might support the idea of multiple nesting levels ... if > so, > >>>>> then the ultimate parent document and all its descendants need to be > in > >>>>> the same segment. > >>>>> > >>>>> Thanks, > >>>>> Shawn > >>>>> > >>>>> > >>>> > >>>> -- > >>>> Sincerely yours > >>>> Mikhail Khludnev > >>>> > >>> > >>> -- > >>> > >>> Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX > >>> Registered in England: Number 1475918. | VAT Number: GB 232 9342 72 > >>> > >>> Contact details for our other offices can be found at > >>> http://www.mintel.com/office-locations > >>> <http://www.mintel.com/office-locations>. > >>> > >>> This email and any attachments > >>> may include content that is confidential, privileged > >>> or otherwise > >>> protected under applicable law. Unauthorised disclosure, copying, > >>> distribution > >>> or use of the contents is prohibited and may be unlawful. If > >>> you have received this email in error, > >>> including without appropriate > >>> authorisation, then please reply to the sender about the error > >>> and delete > >>> this email and any attachments. > >>> > >> > > -- Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX Registered in England: Number 1475918. | VAT Number: GB 232 9342 72 Contact details for our other offices can be found at http://www.mintel.com/office-locations <http://www.mintel.com/office-locations>. This email and any attachments may include content that is confidential, privileged or otherwise protected under applicable law. Unauthorised disclosure, copying, distribution or use of the contents is prohibited and may be unlawful. If you have received this email in error, including without appropriate authorisation, then please reply to the sender about the error and delete this email and any attachments.
