Thank you all. I hope my reply will be sent to the correct 
address/people/thread (--> Ignore unknown fields when indexing PDFs).

I tried these lines (as mentioned by Jeremy Buckley) for my schema:
<dynamicField name="*" type="ignored" /> or
<dynamicField name="*" type="ignored" multiValued="true" />

But an error still occurs during the indexing process:

C:\SOLR\solr-9.2.1>java -Dc=mycore -Dauto -jar 
c:\solr\solr-9.2.1\example\exampledocs\post.jar C:\SOLR\test\_error.pdf
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/mycore/update...
Entering auto mode. File endings considered are 
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file _error.pdf (application/pdf) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: 
http://localhost:8983/solr/mycore/update/extract?resource.name=C%3A%5CSOLR%5Ctest%5C_error.pdf&literal.id=C%3A%5CSOLR%5Ctest%5C_error.pdf
SimplePostTool: WARNING: Response: {
  "responseHeader":{
    "status":400,
    "QTime":28},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","java.lang.NumberFormatException"],
    "msg":"ERROR: [doc=C:\\SOLR\\test\\_error.pdf] Error adding field 
'pdf_docinfo_custom_fs'='FormScape Software 32678' msg=For input string: 
\"FormScape Software 32678\"",
    "code":400}}
SimplePostTool: WARNING: IOException while reading response: 
java.io.IOException: Server returned HTTP response code: 400 for URL: 
http://localhost:8983/solr/mycore/update/extract?resource.name=C%3A%5CSOLR%5Ctest%5C_error.pdf&literal.id=C%3A%5CSOLR%5Ctest%5C_error.pdf
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/mycore/update...
Time spent: 0:00:00.118

The PDF is not indexed, even though it says so.

An extraction of the _error.pdf shows this metadata, among other things:  
"pdf:docinfo:custom:FS",["FormScape Software 32678"]

The error can be prevented by adding the required field:
<field name="pdf_docinfo_custom_fs" type="string" indexed="false" 
stored="false" required="false" />

But I would prefer that unknown fields are simply skipped.

In the documentation is an uprefix parameter mentioned:
https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#trying-out-solr-cell
But when indexing, the same error appears as above.
java -Dc=mycore -Dauto -jar c:\solr\solr-9.2.1\example\exampledocs\post.jar 
C:\SOLR\test\_error.pdf -params "uprefix=ignored_"

Other PDFs are indexed without problems and can be searched for their content.

Any further advice?

> Jeremy Buckley - IQS-C <jeremy.buck...@gsa.gov.invalid> hat am 04.06.2024 
> 19:44 CEST geschrieben:
> 
>  
> Try this.  In your schema, explicitly define all the fields that you want
> in your collection.  Then, as the last field entry, add:
> 
> <dynamicField name="*" type="ignored" />
> 
> On Tue, Jun 4, 2024 at 1:06 PM Thomas Corthals <tho...@klascement.net>
> wrote:
> 
> > When you extra text from PDF with Tika, it includes additional metadata
> > fields. This is the document I get after executing the example from the ref
> > guide at
> >
> > https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#trying-out-solr-cell
> >
> > {
> >   "responseHeader":{
> >     "status":0,
> >     "QTime":0,
> >     "params":{
> >       "q":"id:doc1"
> >     }
> >   },
> >   "response":{
> >     "numFound":1,
> >     "start":0,
> >     "numFoundExact":true,
> >     "docs":[{
> >
> >
> > "meta":["date","2008-11-13T13:35:51Z","pdf:docinfo:custom:AAPL:Keywords","solr,
> > word,
> >
> > pdf","pdf:PDFVersion","1.3","pdf:docinfo:title","solr-word","xmp:CreatorTool","Microsoft
> >
> > Word","stream_content_type","application/pdf","pdf:hasXFA","false","access_permission:can_print_degraded","true","subject","solr
> > word","dc:format","application/pdf;
> > version=1.3","pdf:docinfo:creator_tool","Microsoft
> >
> > Word","access_permission:fill_in_form","true","stream_name","myfile","pdf:encrypted","false","dc:title","solr-word","modified","2008-11-13T13:35:51Z","cp:subject","solr
> > word","pdf:docinfo:subject","solr
> > word","pdf:hasMarkedContent","false","pdf:docinfo:creator","Grant
> > Ingersoll","meta:author","Grant
> >
> > Ingersoll","meta:creation-date","2008-11-13T13:35:51Z","stream_source_info","solr-word.pdf","created","2008-11-13T13:35:51Z","access_permission:extract_for_accessibility","true","Creation-Date","2008-11-13T13:35:51Z","Author","Grant
> > Ingersoll","producer","Mac OS X 10.5.5 Quartz
> > PDFContext","pdf:docinfo:producer","Mac OS X 10.5.5 Quartz
> > PDFContext","Keywords","solr, word,
> > pdf","access_permission:modify_annotations","true","AAPL:Keywords","solr,
> > word, pdf","dc:creator","Grant
> >
> > Ingersoll","dcterms:created","2008-11-13T13:35:51Z","Last-Modified","2008-11-13T13:35:51Z","dcterms:modified","2008-11-13T13:35:51Z","Last-Save-Date","2008-11-13T13:35:51Z","pdf:docinfo:keywords","solr,
> > word,
> >
> > pdf","pdf:docinfo:modified","2008-11-13T13:35:51Z","meta:save-date","2008-11-13T13:35:51Z","Content-Type","application/pdf","stream_size","21052","X-Parsed-By","org.apache.tika.parser.DefaultParser","X-Parsed-By","org.apache.tika.parser.pdf.PDFParser","creator","Grant
> > Ingersoll","dc:subject","solr, word,
> >
> > pdf","access_permission:assemble_document","true","xmpTPg:NPages","1","pdf:hasXMP","false","access_permission:extract_content","true","access_permission:can_print","true","meta:keyword","solr,
> > word,
> >
> > pdf","access_permission:can_modify","true","pdf:docinfo:created","2008-11-13T13:35:51Z"],
> >       "div":["page"],
> >       "id":"doc1",
> >       "date":["2008-11-13T13:35:51Z"],
> >       "pdf_docinfo_custom_aapl_keywords":["solr, word, pdf"],
> >       "pdf_pdfversion":[1.3],
> >       "pdf_docinfo_title":["solr-word"],
> >       "xmp_creatortool":["Microsoft Word"],
> >       "stream_content_type":["application/pdf"],
> >       "pdf_hasxfa":[false],
> >       "access_permission_can_print_degraded":[true],
> >       "subject":["solr word"],
> >       "dc_format":["application/pdf; version=1.3"],
> >       "pdf_docinfo_creator_tool":["Microsoft Word"],
> >       "access_permission_fill_in_form":[true],
> >       "stream_name":["myfile"],
> >       "pdf_encrypted":[false],
> >       "dc_title":["solr-word"],
> >       "modified":["2008-11-13T13:35:51Z"],
> >       "cp_subject":["solr word"],
> >       "pdf_docinfo_subject":["solr word"],
> >       "pdf_hasmarkedcontent":[false],
> >       "pdf_docinfo_creator":["Grant Ingersoll"],
> >       "meta_author":["Grant Ingersoll"],
> >       "meta_creation_date":["2008-11-13T13:35:51Z"],
> >       "stream_source_info":["solr-word.pdf"],
> >       "created":["2008-11-13T13:35:51Z"],
> >       "access_permission_extract_for_accessibility":[true],
> >       "creation_date":["2008-11-13T13:35:51Z"],
> >       "author":["Grant Ingersoll"],
> >       "producer":["Mac OS X 10.5.5 Quartz PDFContext"],
> >       "pdf_docinfo_producer":["Mac OS X 10.5.5 Quartz PDFContext"],
> >       "pdf_unmappedunicodecharsperpage":[0],
> >       "keywords":["solr, word, pdf"],
> >       "access_permission_modify_annotations":[true],
> >       "aapl_keywords":["solr, word, pdf"],
> >       "dc_creator":["Grant Ingersoll"],
> >       "dcterms_created":["2008-11-13T13:35:51Z"],
> >       "last_modified":["2008-11-13T13:35:51Z"],
> >       "dcterms_modified":["2008-11-13T13:35:51Z"],
> >       "title":["solr-word"],
> >       "last_save_date":["2008-11-13T13:35:51Z"],
> >       "pdf_docinfo_keywords":["solr, word, pdf"],
> >       "pdf_docinfo_modified":["2008-11-13T13:35:51Z"],
> >       "meta_save_date":["2008-11-13T13:35:51Z"],
> >       "content_type":["application/pdf"],
> >       "stream_size":[21052],
> >
> >
> > "x_parsed_by":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"],
> >       "creator":["Grant Ingersoll"],
> >       "dc_subject":["solr, word, pdf"],
> >       "access_permission_assemble_document":[true],
> >       "xmptpg_npages":[1],
> >       "pdf_hasxmp":[false],
> >       "pdf_charsperpage":[85],
> >       "access_permission_extract_content":[true],
> >       "access_permission_can_print":[true],
> >       "meta_keyword":["solr, word, pdf"],
> >       "access_permission_can_modify":[true],
> >       "pdf_docinfo_created":["2008-11-13T13:35:51Z"],
> >       "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >  \n  \n  \n  \n solr-word \n \n    \n This is a test of PDF and Word
> > extraction in Solr, it is only a test.  Do not panic.  \n  \n \n  "],
> >       "_version_":1800949864184414208
> >     }]
> >   }
> > }
> >
> > Some of those fields are read from metadata embedded in the PDF file.
> >
> > Op di 4 jun 2024 om 18:15 schreef Walter Underwood <wun...@wunderwood.org
> > >:
> >
> > > PDFs don’t have fields. PDFs are instructions for a monkey with rubber
> > > stamps to make a printed page. They have instructions to move to a
> > location
> > > and put a character there.
> > >
> > > As an XML developer friend said, turning a PDF document into structured
> > > text is like turning hamburger back into a cow.
> > >
> > > I dealt with PDF documents in search for over twenty years. You are lucky
> > > to get searchable text out of them.
> > >
> > > wunder
> > > Walter Underwood
> > > wun...@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > > > On Jun 4, 2024, at 8:28 AM, Uwe Amberger <u...@zurreal.de> wrote:
> > > >
> > > > Hallo!
> > > >
> > > > Problem description:
> > > > I want to index a wide variety of PDFs whose content I have no
> > knowledge
> > > of. So I cannot define any fields in advance. Users should be able to
> > > search for terms, and every PDF containing these terms should be found.
> > > >
> > > > I think that a schemaless schema (which adds unknown fields) is not the
> > > way to go:
> > > > 1. Apache solr documentation warns not to use a schemaless schema in a
> > > production environment.
> > > > 2. As can be read here:
> > >
> > https://solr.apache.org/guide/solr/9_2/indexing-guide/schemaless-mode.html
> > > > "Once a field has been added to the schema, its field type is fixed."
> > > And it cannot be added again with a different field type.
> > > >
> > > > Question:
> > > > When indexing a PDF, is there a way to ignore its unknown fields and
> > > still index the PDF?
> > > >
> > > > Possible solution:
> > > > I found the IgnoreFieldUpdateProcessorFactory class, which seems to
> > > offer this possibility, but how do I configure it in the solrconfig.xml?
> > > >
> > > > Thanks for any help!
> > >
> > >
> >

Reply via email to