Re: distributed processing of RecordBatchReader.read_next_batch()?

Weston Pace Mon, 24 Feb 2025 16:55:12 -0800

It sort of depends what your RecordBatchReader is doing under the hood.  If
it is NOT giving up the GIL then you should be fine as long as your
processing is slower than your reading.  However, if read_next_batch does
not give up the GIL and that's your bottleneck, then your Ray app isn't
going to get any parallelism.


On the other hand, if `read_next_batch` IS giving up the GIL, is it doing
so in a thread safe way that can handle someone calling `read_next_batch`
before the previous call to `read_next_batch` completed?  I believe most of
the C++ impls of RecordBatchReader (e.g. dataset scanners, etc.) should be
doing this.  In that case you should be able to call `read_next_batch`
multiple times.  Still, if your data source is the bottleneck you probably
won't get much compute parallelism.

You probably want to implement the Datasource API[1] for your database.

[1]
https://docs.ray.io/en/latest/data/api/doc/ray.data.Datasource.html#ray.data.Datasource

On Mon, Feb 24, 2025 at 3:42 PM Aldrin <octalene....@pm.me> wrote:

> I could be wrong, but fundamentally the best approach is for the reader to
> be maintained at the "server" ("the distributed database") and each client
> in the distributed compute environment to send get requests (either DoGet
> or some RPC/REST call to Next()).
>
> If you dont want to duplicate data, the reader should exist in one place.
> If you want minimal overhead, then each client should make 1 request (or
> minimal requests) for some chunk of data.
>
> How that is implemented is going to depend on your frameworks (ray.io and
> how your database returns something to the reader). I'm going to guess that
> you maybe have a main client that is managed by the ray.io framework that
> is held in some distributed shared memory (or serves as the "server" the
> reader actually lives at) and each worker is going to have a client or
> wrapper that sends requests to the main client.
>
> I'm not sure if that's helpful for you, but since its a framework
> dependent question I would recommend asking devs of the frameworks (though
> maybe they watch this mailing list).
>
> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>
>
> On Mon, Feb 24, 2025 at 14:38, chris snow <chsnow...@gmail.com
> <On+Mon,+Feb+24,+2025+at+14:38,+chris+snow+%3C%3Ca+href=>> wrote:
>
>  I have a distributed database that returns query responses with a
> RecordBatchReader.
>
> I'd like to distribute consumption of the query response by iterating the
> reader across a distributed compute environment ( ray.io).  I.e. round
> robin the calling  read_next_batch over different nodes of the cluster,
>
> Is this possible?  Ideally, with pyarrow, but I would be interested in
> other languages.
>
> Thanks!
>
>

Re: distributed processing of RecordBatchReader.read_next_batch()?

Reply via email to