Ignore unknown fields when indexing PDFs

Uwe Amberger Tue, 04 Jun 2024 08:28:34 -0700

Hallo!

Problem description:
I want to index a wide variety of PDFs whose content I have no knowledge of. So 
I cannot define any fields in advance. Users should be able to search for 
terms, and every PDF containing these terms should be found.


I think that a schemaless schema (which adds unknown fields) is not the way to 
go:
1. Apache solr documentation warns not to use a schemaless schema in a 
production environment.
2. As can be read here: 
https://solr.apache.org/guide/solr/9_2/indexing-guide/schemaless-mode.html
"Once a field has been added to the schema, its field type is fixed." And it 
cannot be added again with a different field type.

Question:
When indexing a PDF, is there a way to ignore its unknown fields and still 
index the PDF?

Possible solution:
I found the IgnoreFieldUpdateProcessorFactory class, which seems to offer this 
possibility, but how do I configure it in the solrconfig.xml?

Thanks for any help!

Ignore unknown fields when indexing PDFs

Reply via email to