The data import handler is no-longer part of solr, so you may wish to also
ask questions on their discussion boards:
https://github.com/SearchScale/dataimporthandler/discussions

Data Import Handler is a reasonable tool for indexing small, uncomplicated
databases, but does not scale very well as system complexity increases
(it's moderately ok as corpus size increases if complexity stays low).
Since Solr is basically accepting the data straight from DIH, any
massaging, joining or enrichment of the data winds up happening in the
database, either via very complicated queries, or stored procedures, or
secondary processes that duplicate the masaged, solr-ready form to
relatively simple tables. Analysis with Tika within solr is also a feature
that doesn't scale well. Tika is able to handle an amazing variety of data,
but that processing is not free, and produces load on the system during
indexing that competes with resources for serving queries. As systems scale
it's almost always necessary to move Tika Analysis out into a separate
precursor process.

If your pdfs are in a database, DIH might be somewhat relevant, and if you
are just experimenting or you know your data set is small for the long term
it is potentially useful (i.e. < 1M db rows, very few joins, not a lot of
dates or other fields to transform). I have encountered many users who
started with DIH, and grew out of it. The migration away from the super
complex DB infrastructure that was supporting it is frequently costly.

If your PDF's are in a file system, there's really no good reason to use
DIH at all.

(Shameless plug starts here ;) )

Much of the above and other issues too motivated me to create JesterJ
<https://github.com/nsoft/jesterj> which is a free, open source framework
for building out search indexing infrastructure. It released 1.0 some time
ago, and I have an example JesterJ project on github
<https://github.com/nsoft/index-solr-ref-guide> that will crawl a local
copy of the Solr reference guide and index it. It does contain an example
<https://github.com/nsoft/index-solr-ref-guide/blob/main/src/main/java/org/jesterj/index/refguide/SolrRefguideConfig.java#L153>
of using tika (though in default configuration, you will want to
customize). It's just an example so there's lots it doesn't do that would
make that search better, but if you follow the instructions there you do
get results corresponding to the content in the ref guide from a local
solr. Despite releasing and announcing it I've had some difficulty getting
folks to try it, probably because almost everyone with a serious system
already has some sort of indexing solution in place, but one of my specific
intentions was to build a framework that was usable by people just learning
Solr, that would be able to grow with them until the reached an epic scale
(and hopefully were profitable enough to afford transitioning to a custom
high volume system using spark/kafka or whatever suited their problem space
best),

So if you (or any other users here) do try out JesterJ, I'd love to hear
what went well and what didn't. It has a filesystem scanner that will
happily crawl through a directory of documents feeding them into a
processing plan, The last step of most plans is "Send to Solr" of course.
You can think of it as ETL for search data, though search data is somewhat
different from database data so it won't look quite like a traditional ETL.
Of course if you have questions raise them on the JesterJ discussion forums
on github or discord channel.

-Gus



On Tue, Oct 29, 2024 at 8:32 AM Antonio Gallardo <amgallard...@gmail.com>
wrote:

> Hi:
>
> I'm new to the list and with Apache Solr. I'm trying version 9.7.0 on
> Linux Ubuntu 2204 and I want to index multiple pdf files to analyze them
> with tika.
>
> I've created a CORE from the solr admin panel at the following path:
>
>   *
> "/home/myuser/APPS/solr-9.7.0/server/solr/configsets/CORE"
>
> The source of PDF documents to import is located at the path:
>
>   *
> "/home/myuser/documentos/Doc_solr/"
>
> And I have configured 3 files:
>
>   *
> managed-schema.xml:  definición de campos metadatos pdf
>
>   *
> <field name=> Metadatos y Text
>
>   *
> solrconfig.xml
>
>   *
> <requestHandler name="/select" class="solr.SearchHandler">
>   *
> <requestHandler name="/dataimport"
> class="org.apache.solr.handler.dataimport.DataImportHandler">
>
>   *
> tika-data-config.xml:
>
>   *
> <entity name="pdf" processor="TikaEntityProcessor"
>   *
> <entity name="file" processor="FileListEntityProcessor"
>
> Do I need to create another file for the dataimporter.xml configuration?
> What should I include?
>
> Thanks
>


-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Reply via email to