Hallo! Problem description: I want to index a wide variety of PDFs whose content I have no knowledge of. So I cannot define any fields in advance. Users should be able to search for terms, and every PDF containing these terms should be found.
I think that a schemaless schema (which adds unknown fields) is not the way to go: 1. Apache solr documentation warns not to use a schemaless schema in a production environment. 2. As can be read here: https://solr.apache.org/guide/solr/9_2/indexing-guide/schemaless-mode.html "Once a field has been added to the schema, its field type is fixed." And it cannot be added again with a different field type. Question: When indexing a PDF, is there a way to ignore its unknown fields and still index the PDF? Possible solution: I found the IgnoreFieldUpdateProcessorFactory class, which seems to offer this possibility, but how do I configure it in the solrconfig.xml? Thanks for any help!