Vince, Regardless of DIH, LogUpdateProcessorFactory <https://solr.apache.org/guide/solr/latest/configuration-guide/update-request-processors.html#update-processor-factories-you-should-not-modify-or-remove> should log deleteQuery which wiped the docs. You can enable verbose logging and find out what happend.
On Fri, Dec 8, 2023 at 4:29 PM Vince McMahon <sippingonesandze...@gmail.com> wrote: > Hi, ufuk > > I was thinking along the same lines to broaden the tool of choice > on handling delta-load. Flume looks like an interesting option. > > I'm so blessed to be working with so many smart and kind people in this > mailing list. > > Thank you. Happy Friday. > > > > > > > On Fri, Dec 8, 2023 at 1:48 AM ufuk yılmaz <uyil...@vivaldi.net.invalid> > wrote: > > > Hi Vince, > > > > It shouldn’t take too much time to write a simple loop in your favorite > > language which fetches rows from the db and sends them to Solr over http > to > > /update handler. Imo It’s easier than trying to figure out DIH’s > > particularities. Especially in the future, if you need to modify the > > documents based on some logical conditions before indexing. > > > > If you don’t mind learning yet another tool, we used Apache Flume to > index > > data to Solr. It supports moving data from various sources into various > > destinations. For your use case, maybe you can use sql as source and > > MorphlineSolrSink as the destination (ctrl+f here: > > https://flume.apache.org/releases/content/1.11.0/FlumeUserGuide.html) > > There is an sql source plugin here which looks a bit old but may work: > > https://github.com/keedio/flume-ng-sql-source > > You can also write your own source plugin. Flume just helps with > > guaranteed delivery, if you understand it’s way of working. > > > > I don’t know your business case but I’d prefer the first option most of > > the time. > > > > -ufuk yilmaz > > > > — > > > > > On 8 Dec 2023, at 02:22, Vince McMahon <sippingonesandze...@gmail.com> > > wrote: > > > > > > Thanks, Shawn. > > > > > > DIH full-import, by itself works very well. It is bummer that my > > > incremental load itself is into millions. When specifying batchSize on > > > data source, the delta-import will honor that batch size once, for the > > > first fetch, then will loop the rest by hundreds per sec. That doesn't > > > help getting all the Indexing done in a day for my need. > > > > > > I hope this finding may help the maintainer of the code to improve. It > > > took me days to realize it. > > > > > > Thanks, again. > > > > > > > > > > > > On Thu, Dec 7, 2023, 4:49 PM Shawn Heisey <apa...@elyograg.org.invalid > > > > > wrote: > > > > > >>> On 12/7/23 07:56, Vince McMahon wrote: > > >>> { > > >>> "responseHeader": { > > >>> "status": 0, > > >>> "QTime": 0 > > >>> }, > > >>> "initArgs": [ > > >>> "defaults", > > >>> [ > > >>> "config", > > >>> "db-data-config.xml" > > >>> ] > > >>> ], > > >>> "command": "status", > > >>> "status": "idle", > > >>> "importResponse": "", > > >>> "statusMessages": { > > >>> "Total Requests made to DataSource": "1", > > >>> "Total Rows Fetched": "915000", > > >>> "Total Documents Processed": "915000", > > >>> "Total Documents Skipped": "0", > > >>> "Full Dump Started": "2023-12-07 02:54:29", > > >>> "": "Indexing completed. Added/Updated: 915000 documents. Deleted > > >>> 0 documents.", > > >>> "Committed": "2023-12-07 02:54:51", > > >>> "Time taken": "0:0:21.831" > > >>> } > > >>> } > > >> > > >> There's no way Solr can index 915000 docs in 21 seconds without a LOT > of > > >> threads in the indexing program, and DIH is single-threaded. As > you've > > >> already noted, it didn't actually index most of the documents. I > don't > > >> have an answer as to why it didn't work. > > >> > > >> DIH lacks decent logging, error handling, and multi-threading. It is > > >> not the most reliable way to index. This is why it was deprecated a > > >> while back and then removed from 9.x. You would be far better off > > >> writing your own indexing program rather than using DIH. > > >> > > >> I have an idea for a multi-threaded database->solr indexing program, > but > > >> haven't had much time to spend on it. If I can ever get it done, it > > >> will be freely available. > > >> > > >> On the entity, "rows" is not a valid attribute. To control how many > DB > > >> rows are fetched at a time, set batchSize on the dataSource element. > > >> The default batchSize is 500. > > >> > > >> Thanks, > > >> Shawn > > >> > > >> > > > -- Sincerely yours Mikhail Khludnev