Re: [C++] (Eventually) merging asynchronous datasets feature

Wes McKinney Wed, 07 Apr 2021 11:55:34 -0700

I would also lean in the direction of progress to get user feedback
sooner — if our test suite passes stably then it is probably okay to
merge, and if it's possible (without great hardship) to have a
fallback to the non-async version (so there's a workaround if there
end up being show-stopping bugs) then that's even better.


On Wed, Apr 7, 2021 at 1:24 PM Weston Pace <[email protected]> wrote:
>
> 1) Most of the committed changes have been off the main path.  The
> only exception is the streaming CSV reader.  Assuming ARROW-12208 is
> merged (it is close) a stable path would be to revert most of
> ARROW-12161 and change the legacy scanner to simply wrap each call to
> the streaming CSV reader with RunInSerialExecutor.  This would allow
> us to keep the streaming CSV reader in so it isn't zero risk but that
> code has been fully reviewed and merged and has had some chance to run
> CI.  This would also make the streaming CSV reader available for a V2
> API.
> 2) Yes, barring any other feedback, I think I'm leaning in that direction.
> 3) If we keep the streaming CSV reader then I think David's IPC &
> Parquet changes have been mostly mirror paths or inside the file
> formats (which we could convert to a mirror path).  I think the only
> challenge would be getting everything split by Friday but I'll give it
> a shot.
>
> On Wed, Apr 7, 2021 at 6:33 AM Neal Richardson
> <[email protected]> wrote:
> >
> > Three thoughts:
> >
> > 1. Given that lots of prerequisite patches have already merged, and we've
> > seen some instability as a result of those, I don't think it's obviously
> > true that holding ARROW-7001 out of 4.0 is lower risk. It could be that the
> > intermediate state we're in now is higher risk. What do you think?
> > 2. A great way to manage the risk is like what David said: ship the new
> > version as ScannerV2, and let client libraries be able to switch back to
> > the old scanner if they encounter an issue. Bugs are bad, but if there's a
> > fallback or workaround, they're less severe. And we're only going to find
> > some of the issues by putting the code out there in a release and getting
> > real usage on it.
> > 3. But given (1), I wonder how easy it is at this point to ship a separate
> > ScannerV2 that leaves the previous implementation working as it was. I
> > guess the old code is there in the git history.
> >
> > Neal
> >
> > On Wed, Apr 7, 2021 at 7:30 AM Adam Lippai <[email protected]> wrote:
> >
> > > Hi Weston,
> > >
> > > Objective note:
> > > I'm just a user, but I want to add that so far the Arrow releases are
> > > pretty good quality which means you are making good calls.
> > >
> > > Personal opinion:
> > > There were several annoying bugs, where one would have to change a
> > > parameter between parquet V1/V2, threaded / non-threaded, but nothing
> > > exceptional yet.
> > > If you feel the work is ready and you are concerned about it's unusual
> > > size, then I'd say go with the merge, my experience is that size is not
> > > worrying on it's own, there is no need for extra caution.
> > > If you feel it's under-tested, under-reviewed compared to the previous 
> > > code
> > > then it's a different topic, it should be as good as the current 
> > > *average*.
> > > You can make it the default behavior, if the bugs are critical, everybody
> > > can stay on 3.x instead of 4.0 until 4.0.1 arrives or use workarounds (eg.
> > > disable threading).
> > > Version 4.0 is not a critical bugfix release one would need to upgrade to
> > > instantly.
> > > You wouldn't steal them the opportunity to lower the risks or resolve bugs
> > > in production.
> > >
> > > Final thought:
> > > While they are good enough, the releases in this field  - like pandas, 
> > > dask
> > > or airflow - can't be compared to how you deliver the new major versions,
> > > so be proud and optimistic :)
> > >
> > > Best regards,
> > > Adam Lippai
> > >
> > > On Wed, Apr 7, 2021 at 3:53 PM David Li <[email protected]> wrote:
> > >
> > > > Hey Weston,
> > > >
> > > > First, thanks for all your work in getting these changes so far. I think
> > > > it's also been a valuable experience in working with async code, and
> > > > hopefully the problems we've run into so far will help inform further
> > > work,
> > > > including with the query engine.
> > > >
> > > > If you're not comfortable merging, then we shouldn't try to rush it
> > > > through, regardless of reviewer availability. If we're going to split it
> > > > up, I would prefer the 'ScannerV2' approach, as while it'll clutter the
> > > API
> > > > for a bit, at least gives us a fallback if we observe threading issues
> > > 'in
> > > > the wild'.
> > > >
> > > > As for rxcpp - I took a quick look and while it doesn't seem abandoned,
> > > at
> > > > least the maintainers are no longer very active. While the library looks
> > > > mature, there are (for example) issues filed for compatibility with
> > > future
> > > > C++ versions sitting unreviewed, and the blurb about it targeting only
> > > the
> > > > latest C++ compilers might not work so well for us.
> > > >
> > > > I think it may be useful to explicitly align the async generator
> > > utilities
> > > > with their rxcpp or ReactiveX equivalents so that there's some common
> > > > concepts we can refer back to, especially if we further expand the
> > > > utilities. While not many of us may be familiar with rxcpp already, at
> > > > least we'd have a reference for how our utilities are supposed to work.
> > > >
> > > > Using the framework for query execution is an interesting point - doing
> > > so
> > > > might feel like wasted work but again, hopefully we could apply the
> > > lessons
> > > > here towards the query framework. (For instance, integrating any
> > > > debug/tracing utilities we may have wanted, as with detecting 
> > > > 'abandoned'
> > > > futures in ARROW-12207.)
> > > >
> > > > -David
> > > >
> > > > On 2021/04/07 17:18:30, Weston Pace <[email protected]> wrote:
> > > > > I have been working the last few months on ARROW-7001 [0] which
> > > > > enables nested parallelism by converting the dataset scanning to
> > > > > asynchronous (previously announced here[1] and discussed here[2]).  In
> > > > > addition to enabling nested parallelism this also allows for parallel
> > > > > readahead which gives significant benefits on parallel filesystems
> > > > > (i.e. S3).
> > > > >
> > > > > The good news: The PR is passing CI (it looks like there is one
> > > > > failure in python that may or may not be related.  The RTools 3.5
> > > > > failure is expected at the moment).  David Li has done some great work
> > > > > investigating the performance benefits.  There are substantial
> > > > > benefits on S3 for both IPC [3] and Parquet [4] across a variety of
> > > > > parameters.  With additional work[5] the CSV reader could also reap
> > > > > these benefits.
> > > > >
> > > > > The bad news: The change is significant.  Previously Micah has
> > > > > expressed some concern about the viralness of Futures.  Also, when I
> > > > > was requesting a pull review, Antoine expressed some concern about the
> > > > > substantial amount of tricky code we are going to have to maintain in
> > > > > src/arrow/util/async_generator.h.  The change is also innately
> > > > > error-prone as it deals with threading.  Last week's trouble with the
> > > > > R build was a direct result of some of the precursor work for this
> > > > > feature.
> > > > >
> > > > > **For the above reasons I am sadly recommending that this feature not
> > > > > target 4.0 as I had previously hoped.**
> > > > >
> > > > > For discussion:
> > > > >
> > > > > 1) Merging
> > > > >
> > > > > The change is significant (+1400/-730 atm).  I have done my best to
> > > > > break it into smaller PRs (most of these have since been merged).
> > > > > However, I have reached this limit for how much it can reasonably be
> > > > > split up.  The main reason for this is that I am replacing an existing
> > > > > feature.  I cannot regress performance and I cannot regress
> > > > > functionality.  I could split the PR into smaller chunks if I could
> > > > > waive either of those constraints.
> > > > >
> > > > > Possible solutions:
> > > > >
> > > > >  - Just go for it.  This would need someone to volunteer a substantial
> > > > > chunk of time for review.
> > > > >  - Small chunks via branch.  If someone could create a feature branch
> > > > > I could break the PR into smaller commits which intentionally break
> > > > > performance and functionality for the sake of piecemeal simplicity.
> > > > >  - Small chunks via "new API".  This would be similar to how the
> > > > > legacy ParquetDataset was converted to the new datasets API.  A new
> > > > > Scanner (ScannerV2, AsyncScanner, etc.) would be created that would
> > > > > grow to have the same functionality as the old scanner and then the
> > > > > higher level APIs would switch over.  If this is the route we want to
> > > > > go I could potentially have this new API available via C++ for 4.0 for
> > > > > experimental use.
> > > > >
> > > > > 2) Testing
> > > > >
> > > > > There is very little inter-reader multi-threaded testing in the C++
> > > > > code base.  I am not just talking about reading with "use_threads" set
> > > > > to true or false but specifically issuing multiple reads at the same
> > > > > time on potentially overlapping data sources.  If the C++ code base is
> > > > > going to support a query engine then it would seem this sort of
> > > > > testing is essential.  Assuming Ursa is willing I am proposing I start
> > > > > work on this infrastructure during the 5.0 timeframe.  I'll send out
> > > > > additional email describing this work in more detail sometime after
> > > > > the 4.0 release.
> > > > >
> > > > > 3) Maintenance
> > > > >
> > > > > src/arrow/util/async_generator.h has grown to nearly 1500 lines of
> > > > > code.  The short description of these utilities is that they are a
> > > > > pull-based analogue to rxcpp[6].  The file contains the utilities
> > > > > necessary to work with iterators (pull-driven) in an asynchronous
> > > > > fashion.  This code is fairly subtle and it could be difficult to
> > > > > maintain, although I am happy to help do so for the foreseeable
> > > > > future.  It seems, at the very least, that there should be some Arrow
> > > > > committer willing to take on learning and helping maintain the code.
> > > > >
> > > > > Other options:
> > > > >
> > > > > Although I have grown fond of the async generator code I probably have
> > > > > some ugly baby syndrome.  There are some other potential options we
> > > > > could consider.
> > > > >
> > > > >  - Just use rxcpp.  It's an external library and would likely need
> > > > > some glue to mesh with arrow::util::Future/arrow::util::Result.  It is
> > > > > also push-based instead of pull-based and that may require some
> > > > > adjustment to the scanner.
> > > > >  - Wait and use the Arrow execution engine.  There has been discussion
> > > > > of a C++ based execution engine.  From what I can gleam of the design
> > > > > docs this would involve an actor framework of sorts and would probably
> > > > > be compatible with dataset scanning.
> > > > >
> > > > > [0] https://github.com/apache/arrow/pull/9607
> > > > > [1]
> > > >
> > > https://mail-archives.apache.org/mod_mbox/arrow-dev/202102.mbox/%3CCAE4AYb18dWZ2DLo-9N-m2bCrk4zDkhUe6-0G%3D%2ByifLt%2B6hZivg%40mail.gmail.com%3E
> > > > > [2]
> > > >
> > > https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing
> > > > > [3] https://github.com/apache/arrow/pull/9656#discussion_r606289729
> > > > > [4] https://github.com/apache/arrow/pull/9620#issuecomment-814393897
> > > > > [5] https://issues.apache.org/jira/browse/ARROW-11889
> > > > > [6] https://github.com/ReactiveX/RxCpp
> > > > >
> > > >
> > >

Re: [C++] (Eventually) merging asynchronous datasets feature

Reply via email to