Solr Cell uses Apache Tika to ingest PDF and other files. This has the disadvantage that if Tika dies (as is possible with malformed PDFs) or needs huge resources, so does Solr. It's generally accepted to be a better idea to parse binary files outside Solr and index the resulting XML or plain text using a scripted indexer process. Here's one of our example projects you might find useful: https://github.com/o19s/pdf-discovery-demo/tree/master which includes some linked blog posts.

Cheers

Charlie

On 09/08/2023 11:51, Manahan, Rhoden Mark B. - US wrote:
Hi,

I was exploring and trying out SOLR CELL to ingest and index binary files like 
PDF files. I followed the instructions and it worked initially but when I tried 
to redo it from scratch to add some more parameters in the HTTP POST request, 
it stop working. The same PDF file is not anymore being indexed it seems. Any 
thoughts why?

Commands that I have executed:
bin\solr start -e schemaless -Dsolr.modules=extraction
curl -X POST -H "Content-type:application/json" 
http://localhost:8983/solr/gettingstarted/config -d "{'add-requesthandler': {'name': 
'/update/extract', 'class': 'solr.extraction.ExtractingRequestHandler','defaults':{'lowernames': 
'true','captureAttr':'true'}}}"
curl http://localhost:8983/solr/gettingstarted/update/extract?literal.id=doc1&commit=true 
-F "myfile=C:\Temp\solr-9.2.1\example\exampledocs\solr-word.pdf"
---------------------------------------------
bin\solr delete -c gettingstarted
bin\solr stop -p 8983
bin\solr start -e schemaless -Dsolr.modules=extraction
curl -X POST -H "Content-type:application/json" 
http://localhost:8983/solr/gettingstarted/config -d "{'add-requesthandler': {'name': 
'/update/extract', 'class': 'solr.extraction.ExtractingRequestHandler','defaults':{'lowernames': 
'true','captureAttr':'true'}}}"
curl 
http://localhost:8983/solr/gettingstarted/update/extract?literal.id=doc1&uprefix=ignored_&fmap.last_modified=last_modified_dt&commit=true
 -F "myfile=C:\Temp\solr-9.2.1\example\exampledocs\solr-word.pdf"

Appreciate your guidance.

Regards,
Rhoden

________________________________

This electronic message contains information from CACI International Inc or 
subsidiary companies, which may be company sensitive, proprietary, privileged 
or otherwise protected from disclosure. The information is intended to be used 
solely by the recipient(s) named above. If you are not an intended recipient, 
be aware that any review, disclosure, copying, distribution or use of this 
transmission or its contents is prohibited. If you have received this 
transmission in error, please notify the sender immediately.

--
Charlie Hull - Managing Consultant at OpenSource Connections Limited
Founding member of The Search Network and co-author of Searching the Enterprise
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II

Reply via email to