Re: distributed processing of RecordBatchReader.read_next_batch()?

Aldrin Mon, 24 Feb 2025 15:42:07 -0800

I could be wrong, but fundamentally the best approach is for the reader to be 
maintained at the "server" ("the distributed database") and each client in the 
distributed compute environment to send get requests (either DoGet or some 
RPC/REST call to Next()).
If you dont want to duplicate data, the reader should exist in one place. If 
you want minimal overhead, then each client should make 1 request (or minimal 
requests) for some chunk of data.
How that is implemented is going to depend on your frameworks (ray.io and how 
your database returns something to the reader). I'm going to guess that you 
maybe have a main client that is managed by the ray.io framework that is held 
in some distributed shared memory (or serves as the "server" the reader 
actually lives at) and each worker is going to have a client or wrapper that 
sends requests to the main client.
I'm not sure if that's helpful for you, but since its a framework dependent 
question I would recommend asking devs of the frameworks (though maybe they 
watch this mailing list).
 Sent from Proton Mail for iOS 
On Mon, Feb 24, 2025 at 14:38, chris snow &lt;chsnow...@gmail.com&gt; wrote:    
  I have a distributed database that returns query responses with a 
RecordBatchReader.
    
    I'd like to distribute consumption of the query response by iterating the 
reader across a distributed compute environment (
  ray.io).  I.e. round robin the calling   read_next_batch
   over different nodes of the cluster,
    
    Is this possible?  Ideally, with pyarrow, but I would be interested in 
other languages.
    
    Thanks!

signature.asc
Description: OpenPGP digital signature

Re: distributed processing of RecordBatchReader.read_next_batch()?

Reply via email to