You can!
https://lists.apache.org/thread/brw7r0cf0t0m1wltxg5sky6t6d9crgxm

On Sun, Dec 17, 2023 at 3:12 PM Vince McMahon <sippingonesandze...@gmail.com>
wrote:

> Thanks, Gus.  I wish I can "bookmark this reply. lol.
>
> On Sat, Dec 16, 2023 at 11:10 PM Gus Heck <gus.h...@gmail.com> wrote:
>
> > Yes. see the detectChangesViaHashing option here:
> > https://github.com/nsoft/jesterj/wiki/Scanners
> >
> > In any Lucene index there's not really such a thing as incremental
> update.
> > When you want to do an "update" you send the whole document, and it's
> > really a delete/insert the under the covers (there's some esoteric
> > exceptions, but generally this is true). So the way to think about this
> in
> > search ingestion is "did the document change?" If so, send it again.
> There
> > are two general strategies, either consulting a "modifiedTime" field
> (which
> > has to be trustworthy, and still requires persistence to handle deletest)
> > and hashing the document bytes (which always works without schema changes
> > but can be sensitive to trivial changes).
> >
> > Jesterj has an embedded cassandra which tracks document processing
> statuses
> > and (if you configure it) will also use this embedded cassandra to
> remember
> > the hash of the document bytes (in your case the contents of a row) and
> > then on subsequent scans check each row against the prior hash. The main
> > caveat is that the doucment IDs must contain sufficient information to
> > reproducibly retrieve the document By default the concept is a document
> > url, though one could conceivably customize things to use some other
> > scheme. By default a url like jdbc:mydb:localhost:1234/sometable/42 would
> > be used. In that example, the table primary key value is '42'.   Of
> course
> > it takes more CPU to hash the contents of course so depending on the size
> > of your data this may or may not be practical. However, for a few tens of
> > thousands of documents, like you describe, this should work fine.
> >
> > Actual deletes *should* be supported too. (if it's not working let me
> > know). Order is only guaranteed for a given path through the DAG, so If
> you
> > manage to design a system where documents linger a long time in some the
> > ingest paths but not others (longer than a typical scan interval) then
> > there will be the possibility of a delete winning the race with an update
> > or vice-versa, but that maybe is a point at which you should look at your
> > processing more carefully anyway. Right now everything is scan/pull
> based,
> > push based document sources are also possible but hasn't been needed yet,
> > and thus hasn't been added.
> >
> > -Gus
> >
> >
> > On Fri, Dec 15, 2023 at 8:35 PM Vince McMahon <
> > sippingonesandze...@gmail.com>
> > wrote:
> >
> > > I am impressed, Gus.  Does it handle incremental changes from the
> source
> > db
> > > tables, such as insert, update, and delete.
> > >
> > > On Fri, Dec 15, 2023 at 12:58 PM Gus Heck <gus.h...@gmail.com> wrote:
> > >
> > > > Have you considered trying an existing document ingestion framework?
> I
> > > > wrote this one: https://github.com/nsoft/jesterj It already has a
> > > database
> > > > connector. If you do check it out and find difficulty please let me
> > know
> > > by
> > > > leaving bug reports (if bug) or feedback (if confusion) in the
> > > discussions
> > > > section here: https://github.com/nsoft/jesterj/discussions
> > > >
> > > > As Mikhail noted, it's not easy to build a robust ingestion system
> from
> > > > scratch.
> > > >
> > > > -Gus
> > > >
> > > > On Fri, Dec 15, 2023 at 11:11 AM Dmitri Maziuk <
> > dmitri.maz...@gmail.com>
> > > > wrote:
> > > >
> > > > > On 12/15/23 05:41, Vince McMahon wrote:
> > > > > > Ishan, you are right.  Doing multithreaded Indexing is going much
> > > > faster.
> > > > > > I found out after the remote machine became unresponsive very
> > > quickly ;
> > > > > it
> > > > > > crashed.  lol.
> > > > > FWIW I got better results posting docs in batches from a single
> > thread.
> > > > > Work is in a "private org" on gitlab so I can't post the link to
> the
> > > > > code, but the basic layout is a DB reader that yields rows and a
> > writer
> > > > > that does requests.post() of a list of JSON docs. With the DB row
> ->
> > > > > JSON doc transformer in-between.
> > > > >
> > > > > I played with the size of the batch as well as async/await queue
> > before
> > > > > leaving it single-threaded w/ batch size of 5K docs: I had no speed
> > > > > advantage with larger batches in our setup. And it doesn't DDoS the
> > > > > index. ;)
> > > > >
> > > > > Dima
> > > > >
> > > > >
> > > >
> > > > --
> > > > http://www.needhamsoftware.com (work)
> > > > https://a.co/d/b2sZLD9 (my fantasy fiction book)
> > > >
> > >
> >
> >
> > --
> > http://www.needhamsoftware.com (work)
> > https://a.co/d/b2sZLD9 (my fantasy fiction book)
> >
>

Reply via email to