On Mon, Jul 22, 2019 at 10:49 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 18/07/2019 à 00:25, Wes McKinney a écrit :
> >
> > * We look forward in the stream until we find a complete Thrift data
> > page header. This may trigger 0 or more (possibly multiple) Read calls
> > to the underlying "file" handle. In the default case, the data is all
> > actually in memory so the reads are zero copy buffer slices.
>
> If the file is memory-mapped, it doesn't mean everything is in RAM.
> Starting to read a page may incur a page fault and some unexpected
> blocking I/O.
>
> The solution to hide I/O costs could be to use madvise() (in which case
> the background read is done by the kernel without any need for
> user-visible IO threads).  Similarly, on a regular file one can use
> fadvise().  This may mean that the whole issue of "how to hide I/O for a
> given source" may be stream-specific (for example, if a file is
> S3-backed, perhaps you want to issue a HTTP fetch in background?).
>

I think we need to be designing around remote filesystems with
unpredictable latency and throughput. Anyone involved in data
warehousing systems in the cloud is going to be intimately familiar
with these issues -- a system that's designed around local disk and
memory-mapping generally isn't going to adapt well to remote
filesystems.

> > # Model B (CPU and IO work split into tasks that execute on different
> > thread queues)
> >
> > Pros
> > - Not sure
> >
> > Cons
> > - Could cause performance issues if the IO tasks are mostly free (e.g.
> > due to buffering)
>
> In the model B, the decision of whether to use a background thread or
> some other means of hiding I/O costs could also be pushed down into the
> stream implementation.
>
> > I think we need to investigate some asynchronous C++ programming libraries 
> > like
> >
> > https://github.com/facebook/folly/tree/master/folly/fibers
> >
> > to see how organizations with mature C++ practices are handling these
> > issues from a programming model standpoint
>
> Well, right now our model is synchronous I/O.  If we want to switch to
> asynchronous I/O we'll have to redesign a lot of APIs.  Also, since C++
> doesn't have a convenient story for asynchronous I/O or coroutines
> (yet), this will make programming similarly significantly more painful,
> which is (IMO) something we'd like to avoid.  And I'm not mentioning the
> problem of mapping the C++ asynchronous I/O model on the corresponding
> Python primitives...
>

Right, which is why I'm suggesting a simple model to allow threads
that are waiting on IO to allow other threads to execute. Currently
they block.

>
> More generally, I'm wary of significantly complicating our I/O handling
> until we have reliable reproducers of I/O-originated performance issues
> with Arrow.
>

If it helps, I can spend some time implementing Model A as it relates
to reading Parquet files in parallel. If you introduce a small amount
of latency into reads (10-50ms per read call -- such as you would
experience using Amazon S3) the current synchronous approach will have
significant IO-wait-related performance issues.

> Regards
>
> Antoine.

Reply via email to