Try this. In your schema, explicitly define all the fields that you want in your collection. Then, as the last field entry, add:
<dynamicField name="*" type="ignored" /> On Tue, Jun 4, 2024 at 1:06 PM Thomas Corthals <tho...@klascement.net> wrote: > When you extra text from PDF with Tika, it includes additional metadata > fields. This is the document I get after executing the example from the ref > guide at > > https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#trying-out-solr-cell > > { > "responseHeader":{ > "status":0, > "QTime":0, > "params":{ > "q":"id:doc1" > } > }, > "response":{ > "numFound":1, > "start":0, > "numFoundExact":true, > "docs":[{ > > > "meta":["date","2008-11-13T13:35:51Z","pdf:docinfo:custom:AAPL:Keywords","solr, > word, > > pdf","pdf:PDFVersion","1.3","pdf:docinfo:title","solr-word","xmp:CreatorTool","Microsoft > > Word","stream_content_type","application/pdf","pdf:hasXFA","false","access_permission:can_print_degraded","true","subject","solr > word","dc:format","application/pdf; > version=1.3","pdf:docinfo:creator_tool","Microsoft > > Word","access_permission:fill_in_form","true","stream_name","myfile","pdf:encrypted","false","dc:title","solr-word","modified","2008-11-13T13:35:51Z","cp:subject","solr > word","pdf:docinfo:subject","solr > word","pdf:hasMarkedContent","false","pdf:docinfo:creator","Grant > Ingersoll","meta:author","Grant > > Ingersoll","meta:creation-date","2008-11-13T13:35:51Z","stream_source_info","solr-word.pdf","created","2008-11-13T13:35:51Z","access_permission:extract_for_accessibility","true","Creation-Date","2008-11-13T13:35:51Z","Author","Grant > Ingersoll","producer","Mac OS X 10.5.5 Quartz > PDFContext","pdf:docinfo:producer","Mac OS X 10.5.5 Quartz > PDFContext","Keywords","solr, word, > pdf","access_permission:modify_annotations","true","AAPL:Keywords","solr, > word, pdf","dc:creator","Grant > > Ingersoll","dcterms:created","2008-11-13T13:35:51Z","Last-Modified","2008-11-13T13:35:51Z","dcterms:modified","2008-11-13T13:35:51Z","Last-Save-Date","2008-11-13T13:35:51Z","pdf:docinfo:keywords","solr, > word, > > pdf","pdf:docinfo:modified","2008-11-13T13:35:51Z","meta:save-date","2008-11-13T13:35:51Z","Content-Type","application/pdf","stream_size","21052","X-Parsed-By","org.apache.tika.parser.DefaultParser","X-Parsed-By","org.apache.tika.parser.pdf.PDFParser","creator","Grant > Ingersoll","dc:subject","solr, word, > > pdf","access_permission:assemble_document","true","xmpTPg:NPages","1","pdf:hasXMP","false","access_permission:extract_content","true","access_permission:can_print","true","meta:keyword","solr, > word, > > pdf","access_permission:can_modify","true","pdf:docinfo:created","2008-11-13T13:35:51Z"], > "div":["page"], > "id":"doc1", > "date":["2008-11-13T13:35:51Z"], > "pdf_docinfo_custom_aapl_keywords":["solr, word, pdf"], > "pdf_pdfversion":[1.3], > "pdf_docinfo_title":["solr-word"], > "xmp_creatortool":["Microsoft Word"], > "stream_content_type":["application/pdf"], > "pdf_hasxfa":[false], > "access_permission_can_print_degraded":[true], > "subject":["solr word"], > "dc_format":["application/pdf; version=1.3"], > "pdf_docinfo_creator_tool":["Microsoft Word"], > "access_permission_fill_in_form":[true], > "stream_name":["myfile"], > "pdf_encrypted":[false], > "dc_title":["solr-word"], > "modified":["2008-11-13T13:35:51Z"], > "cp_subject":["solr word"], > "pdf_docinfo_subject":["solr word"], > "pdf_hasmarkedcontent":[false], > "pdf_docinfo_creator":["Grant Ingersoll"], > "meta_author":["Grant Ingersoll"], > "meta_creation_date":["2008-11-13T13:35:51Z"], > "stream_source_info":["solr-word.pdf"], > "created":["2008-11-13T13:35:51Z"], > "access_permission_extract_for_accessibility":[true], > "creation_date":["2008-11-13T13:35:51Z"], > "author":["Grant Ingersoll"], > "producer":["Mac OS X 10.5.5 Quartz PDFContext"], > "pdf_docinfo_producer":["Mac OS X 10.5.5 Quartz PDFContext"], > "pdf_unmappedunicodecharsperpage":[0], > "keywords":["solr, word, pdf"], > "access_permission_modify_annotations":[true], > "aapl_keywords":["solr, word, pdf"], > "dc_creator":["Grant Ingersoll"], > "dcterms_created":["2008-11-13T13:35:51Z"], > "last_modified":["2008-11-13T13:35:51Z"], > "dcterms_modified":["2008-11-13T13:35:51Z"], > "title":["solr-word"], > "last_save_date":["2008-11-13T13:35:51Z"], > "pdf_docinfo_keywords":["solr, word, pdf"], > "pdf_docinfo_modified":["2008-11-13T13:35:51Z"], > "meta_save_date":["2008-11-13T13:35:51Z"], > "content_type":["application/pdf"], > "stream_size":[21052], > > > "x_parsed_by":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"], > "creator":["Grant Ingersoll"], > "dc_subject":["solr, word, pdf"], > "access_permission_assemble_document":[true], > "xmptpg_npages":[1], > "pdf_hasxmp":[false], > "pdf_charsperpage":[85], > "access_permission_extract_content":[true], > "access_permission_can_print":[true], > "meta_keyword":["solr, word, pdf"], > "access_permission_can_modify":[true], > "pdf_docinfo_created":["2008-11-13T13:35:51Z"], > "content":[" \n \n \n \n \n \n \n \n \n \n \n \n \n \n > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n > \n \n \n \n solr-word \n \n \n This is a test of PDF and Word > extraction in Solr, it is only a test. Do not panic. \n \n \n "], > "_version_":1800949864184414208 > }] > } > } > > Some of those fields are read from metadata embedded in the PDF file. > > Op di 4 jun 2024 om 18:15 schreef Walter Underwood <wun...@wunderwood.org > >: > > > PDFs don’t have fields. PDFs are instructions for a monkey with rubber > > stamps to make a printed page. They have instructions to move to a > location > > and put a character there. > > > > As an XML developer friend said, turning a PDF document into structured > > text is like turning hamburger back into a cow. > > > > I dealt with PDF documents in search for over twenty years. You are lucky > > to get searchable text out of them. > > > > wunder > > Walter Underwood > > wun...@wunderwood.org > > http://observer.wunderwood.org/ (my blog) > > > > > On Jun 4, 2024, at 8:28 AM, Uwe Amberger <u...@zurreal.de> wrote: > > > > > > Hallo! > > > > > > Problem description: > > > I want to index a wide variety of PDFs whose content I have no > knowledge > > of. So I cannot define any fields in advance. Users should be able to > > search for terms, and every PDF containing these terms should be found. > > > > > > I think that a schemaless schema (which adds unknown fields) is not the > > way to go: > > > 1. Apache solr documentation warns not to use a schemaless schema in a > > production environment. > > > 2. As can be read here: > > > https://solr.apache.org/guide/solr/9_2/indexing-guide/schemaless-mode.html > > > "Once a field has been added to the schema, its field type is fixed." > > And it cannot be added again with a different field type. > > > > > > Question: > > > When indexing a PDF, is there a way to ignore its unknown fields and > > still index the PDF? > > > > > > Possible solution: > > > I found the IgnoreFieldUpdateProcessorFactory class, which seems to > > offer this possibility, but how do I configure it in the solrconfig.xml? > > > > > > Thanks for any help! > > > > >