Thanks, Gus.  I wish I can "bookmark this reply. lol.

On Sat, Dec 16, 2023 at 11:10 PM Gus Heck <gus.h...@gmail.com> wrote:

> Yes. see the detectChangesViaHashing option here:
> https://github.com/nsoft/jesterj/wiki/Scanners
>
> In any Lucene index there's not really such a thing as incremental update.
> When you want to do an "update" you send the whole document, and it's
> really a delete/insert the under the covers (there's some esoteric
> exceptions, but generally this is true). So the way to think about this in
> search ingestion is "did the document change?" If so, send it again. There
> are two general strategies, either consulting a "modifiedTime" field (which
> has to be trustworthy, and still requires persistence to handle deletest)
> and hashing the document bytes (which always works without schema changes
> but can be sensitive to trivial changes).
>
> Jesterj has an embedded cassandra which tracks document processing statuses
> and (if you configure it) will also use this embedded cassandra to remember
> the hash of the document bytes (in your case the contents of a row) and
> then on subsequent scans check each row against the prior hash. The main
> caveat is that the doucment IDs must contain sufficient information to
> reproducibly retrieve the document By default the concept is a document
> url, though one could conceivably customize things to use some other
> scheme. By default a url like jdbc:mydb:localhost:1234/sometable/42 would
> be used. In that example, the table primary key value is '42'.   Of course
> it takes more CPU to hash the contents of course so depending on the size
> of your data this may or may not be practical. However, for a few tens of
> thousands of documents, like you describe, this should work fine.
>
> Actual deletes *should* be supported too. (if it's not working let me
> know). Order is only guaranteed for a given path through the DAG, so If you
> manage to design a system where documents linger a long time in some the
> ingest paths but not others (longer than a typical scan interval) then
> there will be the possibility of a delete winning the race with an update
> or vice-versa, but that maybe is a point at which you should look at your
> processing more carefully anyway. Right now everything is scan/pull based,
> push based document sources are also possible but hasn't been needed yet,
> and thus hasn't been added.
>
> -Gus
>
>
> On Fri, Dec 15, 2023 at 8:35 PM Vince McMahon <
> sippingonesandze...@gmail.com>
> wrote:
>
> > I am impressed, Gus.  Does it handle incremental changes from the source
> db
> > tables, such as insert, update, and delete.
> >
> > On Fri, Dec 15, 2023 at 12:58 PM Gus Heck <gus.h...@gmail.com> wrote:
> >
> > > Have you considered trying an existing document ingestion framework? I
> > > wrote this one: https://github.com/nsoft/jesterj It already has a
> > database
> > > connector. If you do check it out and find difficulty please let me
> know
> > by
> > > leaving bug reports (if bug) or feedback (if confusion) in the
> > discussions
> > > section here: https://github.com/nsoft/jesterj/discussions
> > >
> > > As Mikhail noted, it's not easy to build a robust ingestion system from
> > > scratch.
> > >
> > > -Gus
> > >
> > > On Fri, Dec 15, 2023 at 11:11 AM Dmitri Maziuk <
> dmitri.maz...@gmail.com>
> > > wrote:
> > >
> > > > On 12/15/23 05:41, Vince McMahon wrote:
> > > > > Ishan, you are right.  Doing multithreaded Indexing is going much
> > > faster.
> > > > > I found out after the remote machine became unresponsive very
> > quickly ;
> > > > it
> > > > > crashed.  lol.
> > > > FWIW I got better results posting docs in batches from a single
> thread.
> > > > Work is in a "private org" on gitlab so I can't post the link to the
> > > > code, but the basic layout is a DB reader that yields rows and a
> writer
> > > > that does requests.post() of a list of JSON docs. With the DB row ->
> > > > JSON doc transformer in-between.
> > > >
> > > > I played with the size of the batch as well as async/await queue
> before
> > > > leaving it single-threaded w/ batch size of 5K docs: I had no speed
> > > > advantage with larger batches in our setup. And it doesn't DDoS the
> > > > index. ;)
> > > >
> > > > Dima
> > > >
> > > >
> > >
> > > --
> > > http://www.needhamsoftware.com (work)
> > > https://a.co/d/b2sZLD9 (my fantasy fiction book)
> > >
> >
>
>
> --
> http://www.needhamsoftware.com (work)
> https://a.co/d/b2sZLD9 (my fantasy fiction book)
>

Reply via email to