Re: [C++] (Eventually) merging asynchronous datasets feature

David Li Wed, 07 Apr 2021 06:53:31 -0700

Hey Weston,

First, thanks for all your work in getting these changes so far. I think it's 
also been a valuable experience in working with async code, and hopefully the 
problems we've run into so far will help inform further work, including with 
the query engine.


If you're not comfortable merging, then we shouldn't try to rush it through, 
regardless of reviewer availability. If we're going to split it up, I would 
prefer the 'ScannerV2' approach, as while it'll clutter the API for a bit, at 
least gives us a fallback if we observe threading issues 'in the wild'.

As for rxcpp - I took a quick look and while it doesn't seem abandoned, at 
least the maintainers are no longer very active. While the library looks 
mature, there are (for example) issues filed for compatibility with future C++ 
versions sitting unreviewed, and the blurb about it targeting only the latest 
C++ compilers might not work so well for us. 

I think it may be useful to explicitly align the async generator utilities with 
their rxcpp or ReactiveX equivalents so that there's some common concepts we 
can refer back to, especially if we further expand the utilities. While not 
many of us may be familiar with rxcpp already, at least we'd have a reference 
for how our utilities are supposed to work.

Using the framework for query execution is an interesting point - doing so 
might feel like wasted work but again, hopefully we could apply the lessons 
here towards the query framework. (For instance, integrating any debug/tracing 
utilities we may have wanted, as with detecting 'abandoned' futures in 
ARROW-12207.)

-David

On 2021/04/07 17:18:30, Weston Pace <weston.p...@gmail.com> wrote: 
> I have been working the last few months on ARROW-7001 [0] which
> enables nested parallelism by converting the dataset scanning to
> asynchronous (previously announced here[1] and discussed here[2]).  In
> addition to enabling nested parallelism this also allows for parallel
> readahead which gives significant benefits on parallel filesystems
> (i.e. S3).
> 
> The good news: The PR is passing CI (it looks like there is one
> failure in python that may or may not be related.  The RTools 3.5
> failure is expected at the moment).  David Li has done some great work
> investigating the performance benefits.  There are substantial
> benefits on S3 for both IPC [3] and Parquet [4] across a variety of
> parameters.  With additional work[5] the CSV reader could also reap
> these benefits.
> 
> The bad news: The change is significant.  Previously Micah has
> expressed some concern about the viralness of Futures.  Also, when I
> was requesting a pull review, Antoine expressed some concern about the
> substantial amount of tricky code we are going to have to maintain in
> src/arrow/util/async_generator.h.  The change is also innately
> error-prone as it deals with threading.  Last week's trouble with the
> R build was a direct result of some of the precursor work for this
> feature.
> 
> **For the above reasons I am sadly recommending that this feature not
> target 4.0 as I had previously hoped.**
> 
> For discussion:
> 
> 1) Merging
> 
> The change is significant (+1400/-730 atm).  I have done my best to
> break it into smaller PRs (most of these have since been merged).
> However, I have reached this limit for how much it can reasonably be
> split up.  The main reason for this is that I am replacing an existing
> feature.  I cannot regress performance and I cannot regress
> functionality.  I could split the PR into smaller chunks if I could
> waive either of those constraints.
> 
> Possible solutions:
> 
>  - Just go for it.  This would need someone to volunteer a substantial
> chunk of time for review.
>  - Small chunks via branch.  If someone could create a feature branch
> I could break the PR into smaller commits which intentionally break
> performance and functionality for the sake of piecemeal simplicity.
>  - Small chunks via "new API".  This would be similar to how the
> legacy ParquetDataset was converted to the new datasets API.  A new
> Scanner (ScannerV2, AsyncScanner, etc.) would be created that would
> grow to have the same functionality as the old scanner and then the
> higher level APIs would switch over.  If this is the route we want to
> go I could potentially have this new API available via C++ for 4.0 for
> experimental use.
> 
> 2) Testing
> 
> There is very little inter-reader multi-threaded testing in the C++
> code base.  I am not just talking about reading with "use_threads" set
> to true or false but specifically issuing multiple reads at the same
> time on potentially overlapping data sources.  If the C++ code base is
> going to support a query engine then it would seem this sort of
> testing is essential.  Assuming Ursa is willing I am proposing I start
> work on this infrastructure during the 5.0 timeframe.  I'll send out
> additional email describing this work in more detail sometime after
> the 4.0 release.
> 
> 3) Maintenance
> 
> src/arrow/util/async_generator.h has grown to nearly 1500 lines of
> code.  The short description of these utilities is that they are a
> pull-based analogue to rxcpp[6].  The file contains the utilities
> necessary to work with iterators (pull-driven) in an asynchronous
> fashion.  This code is fairly subtle and it could be difficult to
> maintain, although I am happy to help do so for the foreseeable
> future.  It seems, at the very least, that there should be some Arrow
> committer willing to take on learning and helping maintain the code.
> 
> Other options:
> 
> Although I have grown fond of the async generator code I probably have
> some ugly baby syndrome.  There are some other potential options we
> could consider.
> 
>  - Just use rxcpp.  It's an external library and would likely need
> some glue to mesh with arrow::util::Future/arrow::util::Result.  It is
> also push-based instead of pull-based and that may require some
> adjustment to the scanner.
>  - Wait and use the Arrow execution engine.  There has been discussion
> of a C++ based execution engine.  From what I can gleam of the design
> docs this would involve an actor framework of sorts and would probably
> be compatible with dataset scanning.
> 
> [0] https://github.com/apache/arrow/pull/9607
> [1] 
> https://mail-archives.apache.org/mod_mbox/arrow-dev/202102.mbox/%3CCAE4AYb18dWZ2DLo-9N-m2bCrk4zDkhUe6-0G%3D%2ByifLt%2B6hZivg%40mail.gmail.com%3E
> [2] 
> https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing
> [3] https://github.com/apache/arrow/pull/9656#discussion_r606289729
> [4] https://github.com/apache/arrow/pull/9620#issuecomment-814393897
> [5] https://issues.apache.org/jira/browse/ARROW-11889
> [6] https://github.com/ReactiveX/RxCpp
>

Re: [C++] (Eventually) merging asynchronous datasets feature

Reply via email to