PDFs don’t have fields. PDFs are instructions for a monkey with rubber stamps to make a printed page. They have instructions to move to a location and put a character there.
As an XML developer friend said, turning a PDF document into structured text is like turning hamburger back into a cow. I dealt with PDF documents in search for over twenty years. You are lucky to get searchable text out of them. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 4, 2024, at 8:28 AM, Uwe Amberger <u...@zurreal.de> wrote: > > Hallo! > > Problem description: > I want to index a wide variety of PDFs whose content I have no knowledge of. > So I cannot define any fields in advance. Users should be able to search for > terms, and every PDF containing these terms should be found. > > I think that a schemaless schema (which adds unknown fields) is not the way > to go: > 1. Apache solr documentation warns not to use a schemaless schema in a > production environment. > 2. As can be read here: > https://solr.apache.org/guide/solr/9_2/indexing-guide/schemaless-mode.html > "Once a field has been added to the schema, its field type is fixed." And it > cannot be added again with a different field type. > > Question: > When indexing a PDF, is there a way to ignore its unknown fields and still > index the PDF? > > Possible solution: > I found the IgnoreFieldUpdateProcessorFactory class, which seems to offer > this possibility, but how do I configure it in the solrconfig.xml? > > Thanks for any help!