The data import handler is no-longer part of solr, so you may wish to also ask questions on their discussion boards: https://github.com/SearchScale/dataimporthandler/discussions
Data Import Handler is a reasonable tool for indexing small, uncomplicated databases, but does not scale very well as system complexity increases (it's moderately ok as corpus size increases if complexity stays low). Since Solr is basically accepting the data straight from DIH, any massaging, joining or enrichment of the data winds up happening in the database, either via very complicated queries, or stored procedures, or secondary processes that duplicate the masaged, solr-ready form to relatively simple tables. Analysis with Tika within solr is also a feature that doesn't scale well. Tika is able to handle an amazing variety of data, but that processing is not free, and produces load on the system during indexing that competes with resources for serving queries. As systems scale it's almost always necessary to move Tika Analysis out into a separate precursor process. If your pdfs are in a database, DIH might be somewhat relevant, and if you are just experimenting or you know your data set is small for the long term it is potentially useful (i.e. < 1M db rows, very few joins, not a lot of dates or other fields to transform). I have encountered many users who started with DIH, and grew out of it. The migration away from the super complex DB infrastructure that was supporting it is frequently costly. If your PDF's are in a file system, there's really no good reason to use DIH at all. (Shameless plug starts here ;) ) Much of the above and other issues too motivated me to create JesterJ <https://github.com/nsoft/jesterj> which is a free, open source framework for building out search indexing infrastructure. It released 1.0 some time ago, and I have an example JesterJ project on github <https://github.com/nsoft/index-solr-ref-guide> that will crawl a local copy of the Solr reference guide and index it. It does contain an example <https://github.com/nsoft/index-solr-ref-guide/blob/main/src/main/java/org/jesterj/index/refguide/SolrRefguideConfig.java#L153> of using tika (though in default configuration, you will want to customize). It's just an example so there's lots it doesn't do that would make that search better, but if you follow the instructions there you do get results corresponding to the content in the ref guide from a local solr. Despite releasing and announcing it I've had some difficulty getting folks to try it, probably because almost everyone with a serious system already has some sort of indexing solution in place, but one of my specific intentions was to build a framework that was usable by people just learning Solr, that would be able to grow with them until the reached an epic scale (and hopefully were profitable enough to afford transitioning to a custom high volume system using spark/kafka or whatever suited their problem space best), So if you (or any other users here) do try out JesterJ, I'd love to hear what went well and what didn't. It has a filesystem scanner that will happily crawl through a directory of documents feeding them into a processing plan, The last step of most plans is "Send to Solr" of course. You can think of it as ETL for search data, though search data is somewhat different from database data so it won't look quite like a traditional ETL. Of course if you have questions raise them on the JesterJ discussion forums on github or discord channel. -Gus On Tue, Oct 29, 2024 at 8:32 AM Antonio Gallardo <amgallard...@gmail.com> wrote: > Hi: > > I'm new to the list and with Apache Solr. I'm trying version 9.7.0 on > Linux Ubuntu 2204 and I want to index multiple pdf files to analyze them > with tika. > > I've created a CORE from the solr admin panel at the following path: > > * > "/home/myuser/APPS/solr-9.7.0/server/solr/configsets/CORE" > > The source of PDF documents to import is located at the path: > > * > "/home/myuser/documentos/Doc_solr/" > > And I have configured 3 files: > > * > managed-schema.xml: definición de campos metadatos pdf > > * > <field name=> Metadatos y Text > > * > solrconfig.xml > > * > <requestHandler name="/select" class="solr.SearchHandler"> > * > <requestHandler name="/dataimport" > class="org.apache.solr.handler.dataimport.DataImportHandler"> > > * > tika-data-config.xml: > > * > <entity name="pdf" processor="TikaEntityProcessor" > * > <entity name="file" processor="FileListEntityProcessor" > > Do I need to create another file for the dataimporter.xml configuration? > What should I include? > > Thanks > -- http://www.needhamsoftware.com (work) https://a.co/d/b2sZLD9 (my fantasy fiction book)