Re: Ignore unknown fields when indexing PDFs

Walter Underwood Tue, 04 Jun 2024 09:15:52 -0700

PDFs don’t have fields. PDFs are instructions for a monkey with rubber stamps 
to make a printed page. They have instructions to move to a location and put a 
character there.


As an XML developer friend said, turning a PDF document into structured text is 
like turning hamburger back into a cow.

I dealt with PDF documents in search for over twenty years. You are lucky to 
get searchable text out of them.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 4, 2024, at 8:28 AM, Uwe Amberger <u...@zurreal.de> wrote:
> 
> Hallo!
> 
> Problem description:
> I want to index a wide variety of PDFs whose content I have no knowledge of. 
> So I cannot define any fields in advance. Users should be able to search for 
> terms, and every PDF containing these terms should be found.
> 
> I think that a schemaless schema (which adds unknown fields) is not the way 
> to go:
> 1. Apache solr documentation warns not to use a schemaless schema in a 
> production environment.
> 2. As can be read here: 
> https://solr.apache.org/guide/solr/9_2/indexing-guide/schemaless-mode.html
> "Once a field has been added to the schema, its field type is fixed." And it 
> cannot be added again with a different field type.
> 
> Question:
> When indexing a PDF, is there a way to ignore its unknown fields and still 
> index the PDF?
> 
> Possible solution:
> I found the IgnoreFieldUpdateProcessorFactory class, which seems to offer 
> this possibility, but how do I configure it in the solrconfig.xml?
> 
> Thanks for any help!

Re: Ignore unknown fields when indexing PDFs

Reply via email to