Vince,
Regardless of DIH, LogUpdateProcessorFactory
<https://solr.apache.org/guide/solr/latest/configuration-guide/update-request-processors.html#update-processor-factories-you-should-not-modify-or-remove>
should log deleteQuery which wiped the docs. You can enable verbose logging
and find out what happend.

On Fri, Dec 8, 2023 at 4:29 PM Vince McMahon <sippingonesandze...@gmail.com>
wrote:

> Hi,  ufuk
>
> I was thinking along the same lines to broaden the tool of choice
> on handling delta-load.  Flume looks like an interesting option.
>
> I'm so blessed to be working with so many smart and kind people in this
> mailing list.
>
> Thank you.  Happy Friday.
>
>
>
>
>
>
> On Fri, Dec 8, 2023 at 1:48 AM ufuk yılmaz <uyil...@vivaldi.net.invalid>
> wrote:
>
> > Hi Vince,
> >
> > It shouldn’t take too much time to write a simple loop in your favorite
> > language which fetches rows from the db and sends them to Solr over http
> to
> > /update handler. Imo It’s easier than trying to figure out DIH’s
> > particularities. Especially in the future, if you need to modify the
> > documents based on some logical conditions before indexing.
> >
> > If you don’t mind learning yet another tool, we used Apache Flume to
> index
> > data to Solr. It supports moving data from various sources into various
> > destinations. For your use case, maybe you can use sql as source and
> > MorphlineSolrSink as the destination (ctrl+f here:
> > https://flume.apache.org/releases/content/1.11.0/FlumeUserGuide.html)
> > There is an sql source plugin here which looks a bit old but may work:
> > https://github.com/keedio/flume-ng-sql-source
> > You can also write your own source plugin. Flume just helps with
> > guaranteed delivery, if you understand it’s way of working.
> >
> > I don’t know your business case but I’d prefer the first option most of
> > the time.
> >
> > -ufuk yilmaz
> >
> > —
> >
> > > On 8 Dec 2023, at 02:22, Vince McMahon <sippingonesandze...@gmail.com>
> > wrote:
> > >
> > > Thanks, Shawn.
> > >
> > > DIH full-import, by itself works very well.  It is bummer that my
> > > incremental load itself is into millions.  When specifying batchSize on
> > > data source, the delta-import will honor that batch size once, for the
> > > first fetch, then will loop the rest by hundreds per sec.  That doesn't
> > > help getting all the Indexing done in a day for my need.
> > >
> > > I hope this finding may help the maintainer of the code to improve.  It
> > > took me days to realize it.
> > >
> > > Thanks, again.
> > >
> > >
> > >
> > > On Thu, Dec 7, 2023, 4:49 PM Shawn Heisey <apa...@elyograg.org.invalid
> >
> > > wrote:
> > >
> > >>> On 12/7/23 07:56, Vince McMahon wrote:
> > >>> {
> > >>>   "responseHeader": {
> > >>>     "status": 0,
> > >>>     "QTime": 0
> > >>>   },
> > >>>   "initArgs": [
> > >>>     "defaults",
> > >>>     [
> > >>>       "config",
> > >>>       "db-data-config.xml"
> > >>>     ]
> > >>>   ],
> > >>>   "command": "status",
> > >>>   "status": "idle",
> > >>>   "importResponse": "",
> > >>>   "statusMessages": {
> > >>>     "Total Requests made to DataSource": "1",
> > >>>     "Total Rows Fetched": "915000",
> > >>>     "Total Documents Processed": "915000",
> > >>>     "Total Documents Skipped": "0",
> > >>>     "Full Dump Started": "2023-12-07 02:54:29",
> > >>>     "": "Indexing completed. Added/Updated: 915000 documents. Deleted
> > >>> 0 documents.",
> > >>>     "Committed": "2023-12-07 02:54:51",
> > >>>     "Time taken": "0:0:21.831"
> > >>>   }
> > >>> }
> > >>
> > >> There's no way Solr can index 915000 docs in 21 seconds without a LOT
> of
> > >> threads in the indexing program, and DIH is single-threaded.  As
> you've
> > >> already noted, it didn't actually index most of the documents.  I
> don't
> > >> have an answer as to why it didn't work.
> > >>
> > >> DIH lacks decent logging, error handling, and multi-threading.  It is
> > >> not the most reliable way to index.  This is why it was deprecated a
> > >> while back and then removed from 9.x.  You would be far better off
> > >> writing your own indexing program rather than using DIH.
> > >>
> > >> I have an idea for a multi-threaded database->solr indexing program,
> but
> > >> haven't had much time to spend on it.  If I can ever get it done, it
> > >> will be freely available.
> > >>
> > >> On the entity, "rows" is not a valid attribute.  To control how many
> DB
> > >> rows are fetched at a time, set batchSize on the dataSource element.
> > >> The default batchSize is 500.
> > >>
> > >> Thanks,
> > >> Shawn
> > >>
> > >>
> >
>


-- 
Sincerely yours
Mikhail Khludnev

Reply via email to