Have you considered trying an existing document ingestion framework? I
wrote this one: https://github.com/nsoft/jesterj It already has a database
connector. If you do check it out and find difficulty please let me know by
leaving bug reports (if bug) or feedback (if confusion) in the discussions
section here: https://github.com/nsoft/jesterj/discussions

As Mikhail noted, it's not easy to build a robust ingestion system from
scratch.

-Gus

On Fri, Dec 15, 2023 at 11:11 AM Dmitri Maziuk <dmitri.maz...@gmail.com>
wrote:

> On 12/15/23 05:41, Vince McMahon wrote:
> > Ishan, you are right.  Doing multithreaded Indexing is going much faster.
> > I found out after the remote machine became unresponsive very quickly ;
> it
> > crashed.  lol.
> FWIW I got better results posting docs in batches from a single thread.
> Work is in a "private org" on gitlab so I can't post the link to the
> code, but the basic layout is a DB reader that yields rows and a writer
> that does requests.post() of a list of JSON docs. With the DB row ->
> JSON doc transformer in-between.
>
> I played with the size of the batch as well as async/await queue before
> leaving it single-threaded w/ batch size of 5K docs: I had no speed
> advantage with larger batches in our setup. And it doesn't DDoS the
> index. ;)
>
> Dima
>
>

-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Reply via email to