You can! https://lists.apache.org/thread/brw7r0cf0t0m1wltxg5sky6t6d9crgxm
On Sun, Dec 17, 2023 at 3:12 PM Vince McMahon <sippingonesandze...@gmail.com> wrote: > Thanks, Gus. I wish I can "bookmark this reply. lol. > > On Sat, Dec 16, 2023 at 11:10 PM Gus Heck <gus.h...@gmail.com> wrote: > > > Yes. see the detectChangesViaHashing option here: > > https://github.com/nsoft/jesterj/wiki/Scanners > > > > In any Lucene index there's not really such a thing as incremental > update. > > When you want to do an "update" you send the whole document, and it's > > really a delete/insert the under the covers (there's some esoteric > > exceptions, but generally this is true). So the way to think about this > in > > search ingestion is "did the document change?" If so, send it again. > There > > are two general strategies, either consulting a "modifiedTime" field > (which > > has to be trustworthy, and still requires persistence to handle deletest) > > and hashing the document bytes (which always works without schema changes > > but can be sensitive to trivial changes). > > > > Jesterj has an embedded cassandra which tracks document processing > statuses > > and (if you configure it) will also use this embedded cassandra to > remember > > the hash of the document bytes (in your case the contents of a row) and > > then on subsequent scans check each row against the prior hash. The main > > caveat is that the doucment IDs must contain sufficient information to > > reproducibly retrieve the document By default the concept is a document > > url, though one could conceivably customize things to use some other > > scheme. By default a url like jdbc:mydb:localhost:1234/sometable/42 would > > be used. In that example, the table primary key value is '42'. Of > course > > it takes more CPU to hash the contents of course so depending on the size > > of your data this may or may not be practical. However, for a few tens of > > thousands of documents, like you describe, this should work fine. > > > > Actual deletes *should* be supported too. (if it's not working let me > > know). Order is only guaranteed for a given path through the DAG, so If > you > > manage to design a system where documents linger a long time in some the > > ingest paths but not others (longer than a typical scan interval) then > > there will be the possibility of a delete winning the race with an update > > or vice-versa, but that maybe is a point at which you should look at your > > processing more carefully anyway. Right now everything is scan/pull > based, > > push based document sources are also possible but hasn't been needed yet, > > and thus hasn't been added. > > > > -Gus > > > > > > On Fri, Dec 15, 2023 at 8:35 PM Vince McMahon < > > sippingonesandze...@gmail.com> > > wrote: > > > > > I am impressed, Gus. Does it handle incremental changes from the > source > > db > > > tables, such as insert, update, and delete. > > > > > > On Fri, Dec 15, 2023 at 12:58 PM Gus Heck <gus.h...@gmail.com> wrote: > > > > > > > Have you considered trying an existing document ingestion framework? > I > > > > wrote this one: https://github.com/nsoft/jesterj It already has a > > > database > > > > connector. If you do check it out and find difficulty please let me > > know > > > by > > > > leaving bug reports (if bug) or feedback (if confusion) in the > > > discussions > > > > section here: https://github.com/nsoft/jesterj/discussions > > > > > > > > As Mikhail noted, it's not easy to build a robust ingestion system > from > > > > scratch. > > > > > > > > -Gus > > > > > > > > On Fri, Dec 15, 2023 at 11:11 AM Dmitri Maziuk < > > dmitri.maz...@gmail.com> > > > > wrote: > > > > > > > > > On 12/15/23 05:41, Vince McMahon wrote: > > > > > > Ishan, you are right. Doing multithreaded Indexing is going much > > > > faster. > > > > > > I found out after the remote machine became unresponsive very > > > quickly ; > > > > > it > > > > > > crashed. lol. > > > > > FWIW I got better results posting docs in batches from a single > > thread. > > > > > Work is in a "private org" on gitlab so I can't post the link to > the > > > > > code, but the basic layout is a DB reader that yields rows and a > > writer > > > > > that does requests.post() of a list of JSON docs. With the DB row > -> > > > > > JSON doc transformer in-between. > > > > > > > > > > I played with the size of the batch as well as async/await queue > > before > > > > > leaving it single-threaded w/ batch size of 5K docs: I had no speed > > > > > advantage with larger batches in our setup. And it doesn't DDoS the > > > > > index. ;) > > > > > > > > > > Dima > > > > > > > > > > > > > > > > > > -- > > > > http://www.needhamsoftware.com (work) > > > > https://a.co/d/b2sZLD9 (my fantasy fiction book) > > > > > > > > > > > > > -- > > http://www.needhamsoftware.com (work) > > https://a.co/d/b2sZLD9 (my fantasy fiction book) > > >