Hi, ufuk I was thinking along the same lines to broaden the tool of choice on handling delta-load. Flume looks like an interesting option.
I'm so blessed to be working with so many smart and kind people in this mailing list. Thank you. Happy Friday. On Fri, Dec 8, 2023 at 1:48 AM ufuk yılmaz <uyil...@vivaldi.net.invalid> wrote: > Hi Vince, > > It shouldn’t take too much time to write a simple loop in your favorite > language which fetches rows from the db and sends them to Solr over http to > /update handler. Imo It’s easier than trying to figure out DIH’s > particularities. Especially in the future, if you need to modify the > documents based on some logical conditions before indexing. > > If you don’t mind learning yet another tool, we used Apache Flume to index > data to Solr. It supports moving data from various sources into various > destinations. For your use case, maybe you can use sql as source and > MorphlineSolrSink as the destination (ctrl+f here: > https://flume.apache.org/releases/content/1.11.0/FlumeUserGuide.html) > There is an sql source plugin here which looks a bit old but may work: > https://github.com/keedio/flume-ng-sql-source > You can also write your own source plugin. Flume just helps with > guaranteed delivery, if you understand it’s way of working. > > I don’t know your business case but I’d prefer the first option most of > the time. > > -ufuk yilmaz > > — > > > On 8 Dec 2023, at 02:22, Vince McMahon <sippingonesandze...@gmail.com> > wrote: > > > > Thanks, Shawn. > > > > DIH full-import, by itself works very well. It is bummer that my > > incremental load itself is into millions. When specifying batchSize on > > data source, the delta-import will honor that batch size once, for the > > first fetch, then will loop the rest by hundreds per sec. That doesn't > > help getting all the Indexing done in a day for my need. > > > > I hope this finding may help the maintainer of the code to improve. It > > took me days to realize it. > > > > Thanks, again. > > > > > > > > On Thu, Dec 7, 2023, 4:49 PM Shawn Heisey <apa...@elyograg.org.invalid> > > wrote: > > > >>> On 12/7/23 07:56, Vince McMahon wrote: > >>> { > >>> "responseHeader": { > >>> "status": 0, > >>> "QTime": 0 > >>> }, > >>> "initArgs": [ > >>> "defaults", > >>> [ > >>> "config", > >>> "db-data-config.xml" > >>> ] > >>> ], > >>> "command": "status", > >>> "status": "idle", > >>> "importResponse": "", > >>> "statusMessages": { > >>> "Total Requests made to DataSource": "1", > >>> "Total Rows Fetched": "915000", > >>> "Total Documents Processed": "915000", > >>> "Total Documents Skipped": "0", > >>> "Full Dump Started": "2023-12-07 02:54:29", > >>> "": "Indexing completed. Added/Updated: 915000 documents. Deleted > >>> 0 documents.", > >>> "Committed": "2023-12-07 02:54:51", > >>> "Time taken": "0:0:21.831" > >>> } > >>> } > >> > >> There's no way Solr can index 915000 docs in 21 seconds without a LOT of > >> threads in the indexing program, and DIH is single-threaded. As you've > >> already noted, it didn't actually index most of the documents. I don't > >> have an answer as to why it didn't work. > >> > >> DIH lacks decent logging, error handling, and multi-threading. It is > >> not the most reliable way to index. This is why it was deprecated a > >> while back and then removed from 9.x. You would be far better off > >> writing your own indexing program rather than using DIH. > >> > >> I have an idea for a multi-threaded database->solr indexing program, but > >> haven't had much time to spend on it. If I can ever get it done, it > >> will be freely available. > >> > >> On the entity, "rows" is not a valid attribute. To control how many DB > >> rows are fetched at a time, set batchSize on the dataSource element. > >> The default batchSize is 500. > >> > >> Thanks, > >> Shawn > >> > >> >