Le 18/07/2019 à 00:25, Wes McKinney a écrit : > > * We look forward in the stream until we find a complete Thrift data > page header. This may trigger 0 or more (possibly multiple) Read calls > to the underlying "file" handle. In the default case, the data is all > actually in memory so the reads are zero copy buffer slices.
If the file is memory-mapped, it doesn't mean everything is in RAM. Starting to read a page may incur a page fault and some unexpected blocking I/O. The solution to hide I/O costs could be to use madvise() (in which case the background read is done by the kernel without any need for user-visible IO threads). Similarly, on a regular file one can use fadvise(). This may mean that the whole issue of "how to hide I/O for a given source" may be stream-specific (for example, if a file is S3-backed, perhaps you want to issue a HTTP fetch in background?). > # Model B (CPU and IO work split into tasks that execute on different > thread queues) > > Pros > - Not sure > > Cons > - Could cause performance issues if the IO tasks are mostly free (e.g. > due to buffering) In the model B, the decision of whether to use a background thread or some other means of hiding I/O costs could also be pushed down into the stream implementation. > I think we need to investigate some asynchronous C++ programming libraries > like > > https://github.com/facebook/folly/tree/master/folly/fibers > > to see how organizations with mature C++ practices are handling these > issues from a programming model standpoint Well, right now our model is synchronous I/O. If we want to switch to asynchronous I/O we'll have to redesign a lot of APIs. Also, since C++ doesn't have a convenient story for asynchronous I/O or coroutines (yet), this will make programming similarly significantly more painful, which is (IMO) something we'd like to avoid. And I'm not mentioning the problem of mapping the C++ asynchronous I/O model on the corresponding Python primitives... More generally, I'm wary of significantly complicating our I/O handling until we have reliable reproducers of I/O-originated performance issues with Arrow. Regards Antoine.