Re: Solr throws errors on empty fields on ingestion

Thomas Corthals Wed, 19 Mar 2025 07:58:37 -0700

This actually looks like a request to the /extract handler.

Can you open an issue at https://github.com/solariumphp/solarium/issues
with the code that causes this behaviour?


Thomas

Op wo 19 mrt 2025 om 15:49 schreef Colvin Cowie <colvin.cowie....@gmail.com
>:

> Hello,
>
> re the "400 OK". I don't see that happening myself locally, I have the
> correct "Bad Request" status line when making requests directly to the
> /update handler.
> Perhaps it's an issue in Solarium-PHP?
>
>
> On Wed, 19 Mar 2025 at 13:26, Ehrenleitner Robert Harald <
> robert.ehrenleit...@plus.ac.at> wrote:
>
> > Hi,
> >
> > that was fast.
> >
> > Actually, I see that the documents which do not have a title are also
> > missing in the index of the older Solr version which is still fed by the
> > older version of Solarium-PHP. So, probably the newer version of
> > Solarium-PHP exposes an error which was there before but was not logged.
> I
> > don't want to check this now.
> >
> > As a side node: It seems like Solr responds with HTTP status "400 OK",
> > which is not a good idea. It should be "400 Invalid request".
> >
> > Thanks for the advice with the filename, that's a good idea. I will
> modify
> > the crawler to fallback to the slug (special term from WordPress) or to
> the
> > filename if the title is empty.
> >
> > Kind regards,
> >
> >
> >
> >
> > Mag.phil. Robert Ehrenleitner, BEng.
> > --
> >
> > Mag.phil. Robert Ehrenleitner, BEng.
> >
> > Web-Developer
> >
> > IT-Services | Application & Digitalization Services
> >
> > Hellbrunner Straße 34 | 5020 Salzburg | Austria
> >
> > Tel.: +43/(0)662/8044 - 6778
> >
> > *www.plus.ac.at <http://www.plus.ac.at>*
> >
> >
> >
> > ------------------------------
> > *Von:* Colvin Cowie <colvin.cowie....@gmail.com>
> > *Gesendet:* Mittwoch, 19. März 2025 11:51
> > *An:* users@solr.apache.org <users@solr.apache.org>
> > *Betreff:* Re: Solr throws errors on empty fields on ingestion
> >
> > [Sie erhalten nicht häufig E-Mails von colvin.cowie....@gmail.com.
> > Weitere Informationen, warum dies wichtig ist, finden Sie unter
> > https://aka.ms/LearnAboutSenderIdentification ]
> >
> > Required fields need non-empty values, as far as I know there's no
> > exceptions to that.
> >
> > Take this from the UX/end user perspective. If a document has no title,
> or
> > an empty title, what does a user expect to see and do with that?
> > If they expect to see *something* then yes I think you should insert a
> > suitable default or a fallback value like the file name or url.
> > If they don't expect to see something (and you can't always provide a
> > title), then the title shouldn't be marked as required.
> >
> > On Wed, 19 Mar 2025 at 10:03, Ehrenleitner Robert Harald <
> > robert.ehrenleit...@plus.ac.at> wrote:
> >
> > >
> > >
> > > Hi all,
> > >
> > > we have a crawler built on our own based on Solarium-PHP which ingests
> > > Solr. Since I have upgraded from 9.6.1 to 9.8.0, I see errors in the
> log
> > of
> > > the crawler. It tells me that Solr complains that the field "title" is
> > > missing. Acutally, it is part of the request, but it's just empty.
> > >
> > > This is a snippet of the request body (for this to be output, I have
> > > inserted a var_dump() in an appropriate place of Solarium-PHP):
> > >
> > > Content-Disposition: form-data; name="literal.publishDate"
> > > Content-Type: text/plain;charset=UTF-8
> > >
> > > 2023-01-12T10:25:06Z
> > > --00000000000002800000000000000000
> > > Content-Disposition: form-data; name="literal.title"
> > > Content-Type: text/plain;charset=UTF-8
> > >
> > >
> > > --00000000000002800000000000000000
> > > Content-Disposition: form-data; name="literal.number"
> > >
> > > And this is the response:
> > >
> > > Error indexing document 14935: wp-content/uploads/loremipsum.pdf: Solr
> > > HTTP error: OK (400)
> > > {
> > >   "responseHeader":{
> > >     "status":400,
> > >     "QTime":121
> > >   },
> > >   "error":{
> > >
> > >
> >
> "metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],
> > >     "msg":"[doc=141396] missing required field: title",
> > >     "code":400
> > >   }
> > > }
> > >
> > > I cannot fix the PDF file having no title (for various non-technical
> > > reasons), nevertheless it was working fine until before the upgrade.
> > >
> > > The schema was created with this JSON data, especially its title field:
> > > {
> > > /* something left out here */
> > >         {
> > >             "name": "title",
> > >             "type": "text_general",
> > >             "stored": true,
> > >             "indexed": true,
> > >             "multiValued": false,
> > >             "required": true
> > >         },
> > > /* something left out here */
> > > }
> > >
> > > The document is not being indexed.
> > >
> > > How can I fix this? Is there probably something in the schema (JSON
> data)
> > > I have to change? Or is it better to replace empty titles with some
> > > constant non-empty string (this can be done in the crawler)?
> > >
> > > I have noticed that in the documentation regarding the field option
> > > "required", it says:
> > >
> > > Instructs Solr to reject any attempts to add a document which does not
> > > have a value for this field. This property defaults to false.
> > >
> > > This is ambiguous for me. What is meant with "does not have a value?"
> > > Well, the value is present but it is an empty string.
> > >
> > > Kind regards,
> > >
> > > Mag.phil. Robert Ehrenleitner, BEng.
> > > --
> > >
> > > Mag.phil. Robert Ehrenleitner, BEng.
> > >
> > > Web-Developer
> > >
> > > IT-Services | Application & Digitalization Services
> > >
> > > Hellbrunner Straße 34 | 5020 Salzburg | Austria
> > >
> > > Tel.: +43/(0)662/8044 - 6778
> > >
> > > *www.plus.ac.at <http://www.plus.ac.at>*
> > >
> > >
> > >
> >
>

Re: Solr throws errors on empty fields on ingestion

Reply via email to