[C++] (Eventually) merging asynchronous datasets feature

Weston Pace Wed, 07 Apr 2021 00:18:57 -0700

I have been working the last few months on ARROW-7001 [0] which
enables nested parallelism by converting the dataset scanning to
asynchronous (previously announced here[1] and discussed here[2]).  In
addition to enabling nested parallelism this also allows for parallel
readahead which gives significant benefits on parallel filesystems
(i.e. S3).

The good news: The PR is passing CI (it looks like there is one
failure in python that may or may not be related. The RTools 3.5
failure is expected at the moment). David Li has done some great work
investigating the performance benefits. There are substantial
benefits on S3 for both IPC [3] and Parquet [4] across a variety of
parameters. With additional work[5] the CSV reader could also reap
these benefits.

The bad news: The change is significant. Previously Micah has
expressed some concern about the viralness of Futures. Also, when I
was requesting a pull review, Antoine expressed some concern about the
substantial amount of tricky code we are going to have to maintain in
src/arrow/util/async_generator.h. The change is also innately
error-prone as it deals with threading. Last week's trouble with the
R build was a direct result of some of the precursor work for this
feature.

**For the above reasons I am sadly recommending that this feature not
target 4.0 as I had previously hoped.**

For discussion:

1) Merging

The change is significant (+1400/-730 atm). I have done my best to
break it into smaller PRs (most of these have since been merged).
However, I have reached this limit for how much it can reasonably be
split up. The main reason for this is that I am replacing an existing
feature. I cannot regress performance and I cannot regress
functionality. I could split the PR into smaller chunks if I could
waive either of those constraints.

Possible solutions:

- Just go for it. This would need someone to volunteer a substantial
chunk of time for review.
- Small chunks via branch. If someone could create a feature branch
I could break the PR into smaller commits which intentionally break
performance and functionality for the sake of piecemeal simplicity.
- Small chunks via "new API". This would be similar to how the
legacy ParquetDataset was converted to the new datasets API. A new
Scanner (ScannerV2, AsyncScanner, etc.) would be created that would
grow to have the same functionality as the old scanner and then the
higher level APIs would switch over. If this is the route we want to
go I could potentially have this new API available via C++ for 4.0 for
experimental use.

2) Testing

There is very little inter-reader multi-threaded testing in the C++
code base. I am not just talking about reading with "use_threads" set
to true or false but specifically issuing multiple reads at the same
time on potentially overlapping data sources. If the C++ code base is
going to support a query engine then it would seem this sort of
testing is essential. Assuming Ursa is willing I am proposing I start
work on this infrastructure during the 5.0 timeframe. I'll send out
additional email describing this work in more detail sometime after
the 4.0 release.

3) Maintenance

src/arrow/util/async_generator.h has grown to nearly 1500 lines of
code. The short description of these utilities is that they are a
pull-based analogue to rxcpp[6]. The file contains the utilities
necessary to work with iterators (pull-driven) in an asynchronous
fashion. This code is fairly subtle and it could be difficult to
maintain, although I am happy to help do so for the foreseeable
future. It seems, at the very least, that there should be some Arrow
committer willing to take on learning and helping maintain the code.

Other options:

Although I have grown fond of the async generator code I probably have
some ugly baby syndrome. There are some other potential options we
could consider.

- Just use rxcpp. It's an external library and would likely need
some glue to mesh with arrow::util::Future/arrow::util::Result. It is
also push-based instead of pull-based and that may require some
adjustment to the scanner.
- Wait and use the Arrow execution engine. There has been discussion
of a C++ based execution engine. From what I can gleam of the design
docs this would involve an actor framework of sorts and would probably
be compatible with dataset scanning.

[0] https://github.com/apache/arrow/pull/9607
[1]
https://mail-archives.apache.org/mod_mbox/arrow-dev/202102.mbox/%3CCAE4AYb18dWZ2DLo-9N-m2bCrk4zDkhUe6-0G%3D%2ByifLt%2B6hZivg%40mail.gmail.com%3E
[2]
https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing
[3] https://github.com/apache/arrow/pull/9656#discussion_r606289729
[4] https://github.com/apache/arrow/pull/9620#issuecomment-814393897
[5] https://issues.apache.org/jira/browse/ARROW-11889
[6] https://github.com/ReactiveX/RxCpp

[C++] (Eventually) merging asynchronous datasets feature

Reply via email to