On 8/15/23 06:35, ufuk yılmaz wrote:
Today I tried to index some pdf files and see how that goes.
After a few hours of unsuccessful attempts using SolrCloud docker image etc. I
decided I should follow the exact path in the documentation first.
I downloaded Solr 8.11 archive from:
https://www.apache.org/dyn/closer.lua/lucene/solr/8.11.2/solr-8.11.2.zip?action=download
Solr 8.x is close to dead. It is in maintenance mode, which means that
major bugs and security issues are all that will be fixed. When 10.0.0
gets released, 8.x will be entirely end of life and 9.x will move to
maintenance mode. Releases are never scheduled in advance, so I do not
know when 10.0.0 will be released.
The instructions in the documentation appear to be incorrect for 8.11.
The schemaless example does not have the correct handler in its config
for SolrCell.
You should be using 9.x. The instructions in the latest ref guide seem
to be complete. Those instructions are not going to be usable with 8.x
because they use functionality only found in 9.x.
https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html
Using SolrCell in production is strongly discouraged. Tika can be very
unstable. It is known to consume large amounts of memory and even
crash. If this kind of problem occurs when Tika is running inside Solr,
that problem will affect Solr too. It is better to write a separate
program that runs Tika and indexes the data gathered to Solr ... a
program with infrastructure that can recover from misbehavior like crashing.
This is stated in the ref guide:
https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#solr-cell-performance-implications
Thanks,
Shawn