Thanks, Shawn.

DIH full-import, by itself works very well.  It is bummer that my
incremental load itself is into millions.  When specifying batchSize on
data source, the delta-import will honor that batch size once, for the
first fetch, then will loop the rest by hundreds per sec.  That doesn't
help getting all the Indexing done in a day for my need.

I hope this finding may help the maintainer of the code to improve.  It
took me days to realize it.

Thanks, again.



On Thu, Dec 7, 2023, 4:49 PM Shawn Heisey <apa...@elyograg.org.invalid>
wrote:

> On 12/7/23 07:56, Vince McMahon wrote:
> > {
> >    "responseHeader": {
> >      "status": 0,
> >      "QTime": 0
> >    },
> >    "initArgs": [
> >      "defaults",
> >      [
> >        "config",
> >        "db-data-config.xml"
> >      ]
> >    ],
> >    "command": "status",
> >    "status": "idle",
> >    "importResponse": "",
> >    "statusMessages": {
> >      "Total Requests made to DataSource": "1",
> >      "Total Rows Fetched": "915000",
> >      "Total Documents Processed": "915000",
> >      "Total Documents Skipped": "0",
> >      "Full Dump Started": "2023-12-07 02:54:29",
> >      "": "Indexing completed. Added/Updated: 915000 documents. Deleted
> > 0 documents.",
> >      "Committed": "2023-12-07 02:54:51",
> >      "Time taken": "0:0:21.831"
> >    }
> > }
>
> There's no way Solr can index 915000 docs in 21 seconds without a LOT of
> threads in the indexing program, and DIH is single-threaded.  As you've
> already noted, it didn't actually index most of the documents.  I don't
> have an answer as to why it didn't work.
>
> DIH lacks decent logging, error handling, and multi-threading.  It is
> not the most reliable way to index.  This is why it was deprecated a
> while back and then removed from 9.x.  You would be far better off
> writing your own indexing program rather than using DIH.
>
> I have an idea for a multi-threaded database->solr indexing program, but
> haven't had much time to spend on it.  If I can ever get it done, it
> will be freely available.
>
> On the entity, "rows" is not a valid attribute.  To control how many DB
> rows are fetched at a time, set batchSize on the dataSource element.
> The default batchSize is 500.
>
> Thanks,
> Shawn
>
>

Reply via email to