On 8/15/23 06:35, ufuk yılmaz wrote:
Today I tried to index some pdf files and see how that goes.

After a few hours of unsuccessful attempts using SolrCloud docker image etc. I 
decided I should follow the exact path in the documentation first.

I downloaded Solr 8.11 archive from: 
https://www.apache.org/dyn/closer.lua/lucene/solr/8.11.2/solr-8.11.2.zip?action=download

Solr 8.x is close to dead. It is in maintenance mode, which means that major bugs and security issues are all that will be fixed. When 10.0.0 gets released, 8.x will be entirely end of life and 9.x will move to maintenance mode. Releases are never scheduled in advance, so I do not know when 10.0.0 will be released.

The instructions in the documentation appear to be incorrect for 8.11. The schemaless example does not have the correct handler in its config for SolrCell.

You should be using 9.x. The instructions in the latest ref guide seem to be complete. Those instructions are not going to be usable with 8.x because they use functionality only found in 9.x.

https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html

Using SolrCell in production is strongly discouraged. Tika can be very unstable. It is known to consume large amounts of memory and even crash. If this kind of problem occurs when Tika is running inside Solr, that problem will affect Solr too. It is better to write a separate program that runs Tika and indexes the data gathered to Solr ... a program with infrastructure that can recover from misbehavior like crashing.

This is stated in the ref guide:

https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#solr-cell-performance-implications

Thanks,
Shawn

Reply via email to