Solr Cell uses Apache Tika to ingest PDF and other files. This has the
disadvantage that if Tika dies (as is possible with malformed PDFs) or
needs huge resources, so does Solr. It's generally accepted to be a
better idea to parse binary files outside Solr and index the resulting
XML or plain text using a scripted indexer process. Here's one of our
example projects you might find useful:
https://github.com/o19s/pdf-discovery-demo/tree/master which includes
some linked blog posts.
Cheers
Charlie
On 09/08/2023 11:51, Manahan, Rhoden Mark B. - US wrote:
Hi,
I was exploring and trying out SOLR CELL to ingest and index binary files like
PDF files. I followed the instructions and it worked initially but when I tried
to redo it from scratch to add some more parameters in the HTTP POST request,
it stop working. The same PDF file is not anymore being indexed it seems. Any
thoughts why?
Commands that I have executed:
bin\solr start -e schemaless -Dsolr.modules=extraction
curl -X POST -H "Content-type:application/json"
http://localhost:8983/solr/gettingstarted/config -d "{'add-requesthandler': {'name':
'/update/extract', 'class': 'solr.extraction.ExtractingRequestHandler','defaults':{'lowernames':
'true','captureAttr':'true'}}}"
curl http://localhost:8983/solr/gettingstarted/update/extract?literal.id=doc1&commit=true
-F "myfile=C:\Temp\solr-9.2.1\example\exampledocs\solr-word.pdf"
---------------------------------------------
bin\solr delete -c gettingstarted
bin\solr stop -p 8983
bin\solr start -e schemaless -Dsolr.modules=extraction
curl -X POST -H "Content-type:application/json"
http://localhost:8983/solr/gettingstarted/config -d "{'add-requesthandler': {'name':
'/update/extract', 'class': 'solr.extraction.ExtractingRequestHandler','defaults':{'lowernames':
'true','captureAttr':'true'}}}"
curl
http://localhost:8983/solr/gettingstarted/update/extract?literal.id=doc1&uprefix=ignored_&fmap.last_modified=last_modified_dt&commit=true
-F "myfile=C:\Temp\solr-9.2.1\example\exampledocs\solr-word.pdf"
Appreciate your guidance.
Regards,
Rhoden
________________________________
This electronic message contains information from CACI International Inc or
subsidiary companies, which may be company sensitive, proprietary, privileged
or otherwise protected from disclosure. The information is intended to be used
solely by the recipient(s) named above. If you are not an intended recipient,
be aware that any review, disclosure, copying, distribution or use of this
transmission or its contents is prohibited. If you have received this
transmission in error, please notify the sender immediately.
--
Charlie Hull - Managing Consultant at OpenSource Connections Limited
Founding member of The Search Network and co-author of Searching the Enterprise
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II