This is a follow up to a discussion from last September [3].  I've
been investigating Arrow's use of threading and I/O and I believe
there are some improvements that could be made.  Arrow is currently
supporting two threading options (single thread and "per-core" thread
pool).  Both of these approaches are hindered if blocking I/O is
performed on a CPU worker thread.

It is somewhat alleviated by using background threads for I/O (in the
readahead iterator) but this implementation is not complete and does
not allow for nested parallelism.  I would like to convert Arrow's I/O
operations to an asynchronous model (expanding on the existing futures
API).  I have already converted the CSV reader in this fashion [2] as
a proof of concept.

I have written a more detailed proposal here [1].  Please feel free to
suggest improvements or alternate approaches.  Also, please let me
know if I missed any goals or considerations I should keep in mind.

Also, hello, this email is a bit of an introduction.  I have
previously made one or two small comments/changes but I am hoping to
be more involved going forwards.  I've mostly worked on proprietary
test and measurement software but have recently joined Ursa Computing
which will allow me more time to work on Arrow.

Thanks,

Weston Pace

[1] 
https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing
[2] https://github.com/apache/arrow/pull/9095
[3] 
https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3CCAJPUwMDmU3rFt6Upyis%3DyXB%3DECkmrjdncgR9xj%3DDFapJt9FfUg%40mail.gmail.com%3E

Reply via email to