Yes. see the detectChangesViaHashing option here: https://github.com/nsoft/jesterj/wiki/Scanners
In any Lucene index there's not really such a thing as incremental update. When you want to do an "update" you send the whole document, and it's really a delete/insert the under the covers (there's some esoteric exceptions, but generally this is true). So the way to think about this in search ingestion is "did the document change?" If so, send it again. There are two general strategies, either consulting a "modifiedTime" field (which has to be trustworthy, and still requires persistence to handle deletest) and hashing the document bytes (which always works without schema changes but can be sensitive to trivial changes). Jesterj has an embedded cassandra which tracks document processing statuses and (if you configure it) will also use this embedded cassandra to remember the hash of the document bytes (in your case the contents of a row) and then on subsequent scans check each row against the prior hash. The main caveat is that the doucment IDs must contain sufficient information to reproducibly retrieve the document By default the concept is a document url, though one could conceivably customize things to use some other scheme. By default a url like jdbc:mydb:localhost:1234/sometable/42 would be used. In that example, the table primary key value is '42'. Of course it takes more CPU to hash the contents of course so depending on the size of your data this may or may not be practical. However, for a few tens of thousands of documents, like you describe, this should work fine. Actual deletes *should* be supported too. (if it's not working let me know). Order is only guaranteed for a given path through the DAG, so If you manage to design a system where documents linger a long time in some the ingest paths but not others (longer than a typical scan interval) then there will be the possibility of a delete winning the race with an update or vice-versa, but that maybe is a point at which you should look at your processing more carefully anyway. Right now everything is scan/pull based, push based document sources are also possible but hasn't been needed yet, and thus hasn't been added. -Gus On Fri, Dec 15, 2023 at 8:35 PM Vince McMahon <sippingonesandze...@gmail.com> wrote: > I am impressed, Gus. Does it handle incremental changes from the source db > tables, such as insert, update, and delete. > > On Fri, Dec 15, 2023 at 12:58 PM Gus Heck <gus.h...@gmail.com> wrote: > > > Have you considered trying an existing document ingestion framework? I > > wrote this one: https://github.com/nsoft/jesterj It already has a > database > > connector. If you do check it out and find difficulty please let me know > by > > leaving bug reports (if bug) or feedback (if confusion) in the > discussions > > section here: https://github.com/nsoft/jesterj/discussions > > > > As Mikhail noted, it's not easy to build a robust ingestion system from > > scratch. > > > > -Gus > > > > On Fri, Dec 15, 2023 at 11:11 AM Dmitri Maziuk <dmitri.maz...@gmail.com> > > wrote: > > > > > On 12/15/23 05:41, Vince McMahon wrote: > > > > Ishan, you are right. Doing multithreaded Indexing is going much > > faster. > > > > I found out after the remote machine became unresponsive very > quickly ; > > > it > > > > crashed. lol. > > > FWIW I got better results posting docs in batches from a single thread. > > > Work is in a "private org" on gitlab so I can't post the link to the > > > code, but the basic layout is a DB reader that yields rows and a writer > > > that does requests.post() of a list of JSON docs. With the DB row -> > > > JSON doc transformer in-between. > > > > > > I played with the size of the batch as well as async/await queue before > > > leaving it single-threaded w/ batch size of 5K docs: I had no speed > > > advantage with larger batches in our setup. And it doesn't DDoS the > > > index. ;) > > > > > > Dima > > > > > > > > > > -- > > http://www.needhamsoftware.com (work) > > https://a.co/d/b2sZLD9 (my fantasy fiction book) > > > -- http://www.needhamsoftware.com (work) https://a.co/d/b2sZLD9 (my fantasy fiction book)