Re: my solr 8.11 is indexing 5000 only using custom code.

Gus Heck Sat, 16 Dec 2023 20:08:58 -0800

Yes. see the detectChangesViaHashing option here:
https://github.com/nsoft/jesterj/wiki/Scanners

In any Lucene index there's not really such a thing as incremental update.
When you want to do an "update" you send the whole document, and it's
really a delete/insert the under the covers (there's some esoteric
exceptions, but generally this is true). So the way to think about this in
search ingestion is "did the document change?" If so, send it again. There
are two general strategies, either consulting a "modifiedTime" field (which
has to be trustworthy, and still requires persistence to handle deletest)
and hashing the document bytes (which always works without schema changes
but can be sensitive to trivial changes).

Jesterj has an embedded cassandra which tracks document processing statuses
and (if you configure it) will also use this embedded cassandra to remember
the hash of the document bytes (in your case the contents of a row) and
then on subsequent scans check each row against the prior hash. The main
caveat is that the doucment IDs must contain sufficient information to
reproducibly retrieve the document By default the concept is a document
url, though one could conceivably customize things to use some other
scheme. By default a url like jdbc:mydb:localhost:1234/sometable/42 would
be used. In that example, the table primary key value is '42'.   Of course
it takes more CPU to hash the contents of course so depending on the size
of your data this may or may not be practical. However, for a few tens of
thousands of documents, like you describe, this should work fine.

Actual deletes *should* be supported too. (if it's not working let me
know). Order is only guaranteed for a given path through the DAG, so If you
manage to design a system where documents linger a long time in some the
ingest paths but not others (longer than a typical scan interval) then
there will be the possibility of a delete winning the race with an update
or vice-versa, but that maybe is a point at which you should look at your
processing more carefully anyway. Right now everything is scan/pull based,
push based document sources are also possible but hasn't been needed yet,
and thus hasn't been added.

-Gus

On Fri, Dec 15, 2023 at 8:35 PM Vince McMahon <sippingonesandze...@gmail.com>
wrote:

> I am impressed, Gus.  Does it handle incremental changes from the source db
> tables, such as insert, update, and delete.
>
> On Fri, Dec 15, 2023 at 12:58 PM Gus Heck <gus.h...@gmail.com> wrote:
>
> > Have you considered trying an existing document ingestion framework? I
> > wrote this one: https://github.com/nsoft/jesterj It already has a
> database
> > connector. If you do check it out and find difficulty please let me know
> by
> > leaving bug reports (if bug) or feedback (if confusion) in the
> discussions
> > section here: https://github.com/nsoft/jesterj/discussions
> >
> > As Mikhail noted, it's not easy to build a robust ingestion system from
> > scratch.
> >
> > -Gus
> >
> > On Fri, Dec 15, 2023 at 11:11 AM Dmitri Maziuk <dmitri.maz...@gmail.com>
> > wrote:
> >
> > > On 12/15/23 05:41, Vince McMahon wrote:
> > > > Ishan, you are right.  Doing multithreaded Indexing is going much
> > faster.
> > > > I found out after the remote machine became unresponsive very
> quickly ;
> > > it
> > > > crashed.  lol.
> > > FWIW I got better results posting docs in batches from a single thread.
> > > Work is in a "private org" on gitlab so I can't post the link to the
> > > code, but the basic layout is a DB reader that yields rows and a writer
> > > that does requests.post() of a list of JSON docs. With the DB row ->
> > > JSON doc transformer in-between.
> > >
> > > I played with the size of the batch as well as async/await queue before
> > > leaving it single-threaded w/ batch size of 5K docs: I had no speed
> > > advantage with larger batches in our setup. And it doesn't DDoS the
> > > index. ;)
> > >
> > > Dima
> > >
> > >
> >
> > --
> > http://www.needhamsoftware.com (work)
> > https://a.co/d/b2sZLD9 (my fantasy fiction book)
> >
>

-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Re: my solr 8.11 is indexing 5000 only using custom code.

Reply via email to