Re: [C++] (Eventually) merging asynchronous datasets feature

Neal Richardson Wed, 07 Apr 2021 09:32:58 -0700

Three thoughts:

1. Given that lots of prerequisite patches have already merged, and we've
seen some instability as a result of those, I don't think it's obviously
true that holding ARROW-7001 out of 4.0 is lower risk. It could be that the
intermediate state we're in now is higher risk. What do you think?
2. A great way to manage the risk is like what David said: ship the new
version as ScannerV2, and let client libraries be able to switch back to
the old scanner if they encounter an issue. Bugs are bad, but if there's a
fallback or workaround, they're less severe. And we're only going to find
some of the issues by putting the code out there in a release and getting
real usage on it.
3. But given (1), I wonder how easy it is at this point to ship a separate
ScannerV2 that leaves the previous implementation working as it was. I
guess the old code is there in the git history.


Neal

On Wed, Apr 7, 2021 at 7:30 AM Adam Lippai <a...@rigo.sk> wrote:

> Hi Weston,
>
> Objective note:
> I'm just a user, but I want to add that so far the Arrow releases are
> pretty good quality which means you are making good calls.
>
> Personal opinion:
> There were several annoying bugs, where one would have to change a
> parameter between parquet V1/V2, threaded / non-threaded, but nothing
> exceptional yet.
> If you feel the work is ready and you are concerned about it's unusual
> size, then I'd say go with the merge, my experience is that size is not
> worrying on it's own, there is no need for extra caution.
> If you feel it's under-tested, under-reviewed compared to the previous code
> then it's a different topic, it should be as good as the current *average*.
> You can make it the default behavior, if the bugs are critical, everybody
> can stay on 3.x instead of 4.0 until 4.0.1 arrives or use workarounds (eg.
> disable threading).
> Version 4.0 is not a critical bugfix release one would need to upgrade to
> instantly.
> You wouldn't steal them the opportunity to lower the risks or resolve bugs
> in production.
>
> Final thought:
> While they are good enough, the releases in this field  - like pandas, dask
> or airflow - can't be compared to how you deliver the new major versions,
> so be proud and optimistic :)
>
> Best regards,
> Adam Lippai
>
> On Wed, Apr 7, 2021 at 3:53 PM David Li <lidav...@apache.org> wrote:
>
> > Hey Weston,
> >
> > First, thanks for all your work in getting these changes so far. I think
> > it's also been a valuable experience in working with async code, and
> > hopefully the problems we've run into so far will help inform further
> work,
> > including with the query engine.
> >
> > If you're not comfortable merging, then we shouldn't try to rush it
> > through, regardless of reviewer availability. If we're going to split it
> > up, I would prefer the 'ScannerV2' approach, as while it'll clutter the
> API
> > for a bit, at least gives us a fallback if we observe threading issues
> 'in
> > the wild'.
> >
> > As for rxcpp - I took a quick look and while it doesn't seem abandoned,
> at
> > least the maintainers are no longer very active. While the library looks
> > mature, there are (for example) issues filed for compatibility with
> future
> > C++ versions sitting unreviewed, and the blurb about it targeting only
> the
> > latest C++ compilers might not work so well for us.
> >
> > I think it may be useful to explicitly align the async generator
> utilities
> > with their rxcpp or ReactiveX equivalents so that there's some common
> > concepts we can refer back to, especially if we further expand the
> > utilities. While not many of us may be familiar with rxcpp already, at
> > least we'd have a reference for how our utilities are supposed to work.
> >
> > Using the framework for query execution is an interesting point - doing
> so
> > might feel like wasted work but again, hopefully we could apply the
> lessons
> > here towards the query framework. (For instance, integrating any
> > debug/tracing utilities we may have wanted, as with detecting 'abandoned'
> > futures in ARROW-12207.)
> >
> > -David
> >
> > On 2021/04/07 17:18:30, Weston Pace <weston.p...@gmail.com> wrote:
> > > I have been working the last few months on ARROW-7001 [0] which
> > > enables nested parallelism by converting the dataset scanning to
> > > asynchronous (previously announced here[1] and discussed here[2]).  In
> > > addition to enabling nested parallelism this also allows for parallel
> > > readahead which gives significant benefits on parallel filesystems
> > > (i.e. S3).
> > >
> > > The good news: The PR is passing CI (it looks like there is one
> > > failure in python that may or may not be related.  The RTools 3.5
> > > failure is expected at the moment).  David Li has done some great work
> > > investigating the performance benefits.  There are substantial
> > > benefits on S3 for both IPC [3] and Parquet [4] across a variety of
> > > parameters.  With additional work[5] the CSV reader could also reap
> > > these benefits.
> > >
> > > The bad news: The change is significant.  Previously Micah has
> > > expressed some concern about the viralness of Futures.  Also, when I
> > > was requesting a pull review, Antoine expressed some concern about the
> > > substantial amount of tricky code we are going to have to maintain in
> > > src/arrow/util/async_generator.h.  The change is also innately
> > > error-prone as it deals with threading.  Last week's trouble with the
> > > R build was a direct result of some of the precursor work for this
> > > feature.
> > >
> > > **For the above reasons I am sadly recommending that this feature not
> > > target 4.0 as I had previously hoped.**
> > >
> > > For discussion:
> > >
> > > 1) Merging
> > >
> > > The change is significant (+1400/-730 atm).  I have done my best to
> > > break it into smaller PRs (most of these have since been merged).
> > > However, I have reached this limit for how much it can reasonably be
> > > split up.  The main reason for this is that I am replacing an existing
> > > feature.  I cannot regress performance and I cannot regress
> > > functionality.  I could split the PR into smaller chunks if I could
> > > waive either of those constraints.
> > >
> > > Possible solutions:
> > >
> > >  - Just go for it.  This would need someone to volunteer a substantial
> > > chunk of time for review.
> > >  - Small chunks via branch.  If someone could create a feature branch
> > > I could break the PR into smaller commits which intentionally break
> > > performance and functionality for the sake of piecemeal simplicity.
> > >  - Small chunks via "new API".  This would be similar to how the
> > > legacy ParquetDataset was converted to the new datasets API.  A new
> > > Scanner (ScannerV2, AsyncScanner, etc.) would be created that would
> > > grow to have the same functionality as the old scanner and then the
> > > higher level APIs would switch over.  If this is the route we want to
> > > go I could potentially have this new API available via C++ for 4.0 for
> > > experimental use.
> > >
> > > 2) Testing
> > >
> > > There is very little inter-reader multi-threaded testing in the C++
> > > code base.  I am not just talking about reading with "use_threads" set
> > > to true or false but specifically issuing multiple reads at the same
> > > time on potentially overlapping data sources.  If the C++ code base is
> > > going to support a query engine then it would seem this sort of
> > > testing is essential.  Assuming Ursa is willing I am proposing I start
> > > work on this infrastructure during the 5.0 timeframe.  I'll send out
> > > additional email describing this work in more detail sometime after
> > > the 4.0 release.
> > >
> > > 3) Maintenance
> > >
> > > src/arrow/util/async_generator.h has grown to nearly 1500 lines of
> > > code.  The short description of these utilities is that they are a
> > > pull-based analogue to rxcpp[6].  The file contains the utilities
> > > necessary to work with iterators (pull-driven) in an asynchronous
> > > fashion.  This code is fairly subtle and it could be difficult to
> > > maintain, although I am happy to help do so for the foreseeable
> > > future.  It seems, at the very least, that there should be some Arrow
> > > committer willing to take on learning and helping maintain the code.
> > >
> > > Other options:
> > >
> > > Although I have grown fond of the async generator code I probably have
> > > some ugly baby syndrome.  There are some other potential options we
> > > could consider.
> > >
> > >  - Just use rxcpp.  It's an external library and would likely need
> > > some glue to mesh with arrow::util::Future/arrow::util::Result.  It is
> > > also push-based instead of pull-based and that may require some
> > > adjustment to the scanner.
> > >  - Wait and use the Arrow execution engine.  There has been discussion
> > > of a C++ based execution engine.  From what I can gleam of the design
> > > docs this would involve an actor framework of sorts and would probably
> > > be compatible with dataset scanning.
> > >
> > > [0] https://github.com/apache/arrow/pull/9607
> > > [1]
> >
> https://mail-archives.apache.org/mod_mbox/arrow-dev/202102.mbox/%3CCAE4AYb18dWZ2DLo-9N-m2bCrk4zDkhUe6-0G%3D%2ByifLt%2B6hZivg%40mail.gmail.com%3E
> > > [2]
> >
> https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing
> > > [3] https://github.com/apache/arrow/pull/9656#discussion_r606289729
> > > [4] https://github.com/apache/arrow/pull/9620#issuecomment-814393897
> > > [5] https://issues.apache.org/jira/browse/ARROW-11889
> > > [6] https://github.com/ReactiveX/RxCpp
> > >
> >
>

Re: [C++] (Eventually) merging asynchronous datasets feature

Reply via email to