Have you considered trying an existing document ingestion framework? I wrote this one: https://github.com/nsoft/jesterj It already has a database connector. If you do check it out and find difficulty please let me know by leaving bug reports (if bug) or feedback (if confusion) in the discussions section here: https://github.com/nsoft/jesterj/discussions
As Mikhail noted, it's not easy to build a robust ingestion system from scratch. -Gus On Fri, Dec 15, 2023 at 11:11 AM Dmitri Maziuk <dmitri.maz...@gmail.com> wrote: > On 12/15/23 05:41, Vince McMahon wrote: > > Ishan, you are right. Doing multithreaded Indexing is going much faster. > > I found out after the remote machine became unresponsive very quickly ; > it > > crashed. lol. > FWIW I got better results posting docs in batches from a single thread. > Work is in a "private org" on gitlab so I can't post the link to the > code, but the basic layout is a DB reader that yields rows and a writer > that does requests.post() of a list of JSON docs. With the DB row -> > JSON doc transformer in-between. > > I played with the size of the batch as well as async/await queue before > leaving it single-threaded w/ batch size of 5K docs: I had no speed > advantage with larger batches in our setup. And it doesn't DDoS the > index. ;) > > Dima > > -- http://www.needhamsoftware.com (work) https://a.co/d/b2sZLD9 (my fantasy fiction book)