Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

Antoine Pitrou Mon, 22 Jul 2019 08:50:21 -0700


Le 18/07/2019 à 00:25, Wes McKinney a écrit :
> 
> * We look forward in the stream until we find a complete Thrift data
> page header. This may trigger 0 or more (possibly multiple) Read calls
> to the underlying "file" handle. In the default case, the data is all
> actually in memory so the reads are zero copy buffer slices.


If the file is memory-mapped, it doesn't mean everything is in RAM.
Starting to read a page may incur a page fault and some unexpected
blocking I/O.

The solution to hide I/O costs could be to use madvise() (in which case
the background read is done by the kernel without any need for
user-visible IO threads).  Similarly, on a regular file one can use
fadvise().  This may mean that the whole issue of "how to hide I/O for a
given source" may be stream-specific (for example, if a file is
S3-backed, perhaps you want to issue a HTTP fetch in background?).

> # Model B (CPU and IO work split into tasks that execute on different
> thread queues)
> 
> Pros
> - Not sure
> 
> Cons
> - Could cause performance issues if the IO tasks are mostly free (e.g.
> due to buffering)

In the model B, the decision of whether to use a background thread or
some other means of hiding I/O costs could also be pushed down into the
stream implementation.

> I think we need to investigate some asynchronous C++ programming libraries 
> like
> 
> https://github.com/facebook/folly/tree/master/folly/fibers
> 
> to see how organizations with mature C++ practices are handling these
> issues from a programming model standpoint

Well, right now our model is synchronous I/O.  If we want to switch to
asynchronous I/O we'll have to redesign a lot of APIs.  Also, since C++
doesn't have a convenient story for asynchronous I/O or coroutines
(yet), this will make programming similarly significantly more painful,
which is (IMO) something we'd like to avoid.  And I'm not mentioning the
problem of mapping the C++ asynchronous I/O model on the corresponding
Python primitives...


More generally, I'm wary of significantly complicating our I/O handling
until we have reliable reproducers of I/O-originated performance issues
with Arrow.

Regards

Antoine.

Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

Reply via email to