This is a follow up to a discussion from last September [3]. I've been investigating Arrow's use of threading and I/O and I believe there are some improvements that could be made. Arrow is currently supporting two threading options (single thread and "per-core" thread pool). Both of these approaches are hindered if blocking I/O is performed on a CPU worker thread.
It is somewhat alleviated by using background threads for I/O (in the readahead iterator) but this implementation is not complete and does not allow for nested parallelism. I would like to convert Arrow's I/O operations to an asynchronous model (expanding on the existing futures API). I have already converted the CSV reader in this fashion [2] as a proof of concept. I have written a more detailed proposal here [1]. Please feel free to suggest improvements or alternate approaches. Also, please let me know if I missed any goals or considerations I should keep in mind. Also, hello, this email is a bit of an introduction. I have previously made one or two small comments/changes but I am hoping to be more involved going forwards. I've mostly worked on proprietary test and measurement software but have recently joined Ursa Computing which will allow me more time to work on Arrow. Thanks, Weston Pace [1] https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing [2] https://github.com/apache/arrow/pull/9095 [3] https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3CCAJPUwMDmU3rFt6Upyis%3DyXB%3DECkmrjdncgR9xj%3DDFapJt9FfUg%40mail.gmail.com%3E