I would also lean in the direction of progress to get user feedback sooner — if our test suite passes stably then it is probably okay to merge, and if it's possible (without great hardship) to have a fallback to the non-async version (so there's a workaround if there end up being show-stopping bugs) then that's even better.
On Wed, Apr 7, 2021 at 1:24 PM Weston Pace <weston.p...@gmail.com> wrote: > > 1) Most of the committed changes have been off the main path. The > only exception is the streaming CSV reader. Assuming ARROW-12208 is > merged (it is close) a stable path would be to revert most of > ARROW-12161 and change the legacy scanner to simply wrap each call to > the streaming CSV reader with RunInSerialExecutor. This would allow > us to keep the streaming CSV reader in so it isn't zero risk but that > code has been fully reviewed and merged and has had some chance to run > CI. This would also make the streaming CSV reader available for a V2 > API. > 2) Yes, barring any other feedback, I think I'm leaning in that direction. > 3) If we keep the streaming CSV reader then I think David's IPC & > Parquet changes have been mostly mirror paths or inside the file > formats (which we could convert to a mirror path). I think the only > challenge would be getting everything split by Friday but I'll give it > a shot. > > On Wed, Apr 7, 2021 at 6:33 AM Neal Richardson > <neal.p.richard...@gmail.com> wrote: > > > > Three thoughts: > > > > 1. Given that lots of prerequisite patches have already merged, and we've > > seen some instability as a result of those, I don't think it's obviously > > true that holding ARROW-7001 out of 4.0 is lower risk. It could be that the > > intermediate state we're in now is higher risk. What do you think? > > 2. A great way to manage the risk is like what David said: ship the new > > version as ScannerV2, and let client libraries be able to switch back to > > the old scanner if they encounter an issue. Bugs are bad, but if there's a > > fallback or workaround, they're less severe. And we're only going to find > > some of the issues by putting the code out there in a release and getting > > real usage on it. > > 3. But given (1), I wonder how easy it is at this point to ship a separate > > ScannerV2 that leaves the previous implementation working as it was. I > > guess the old code is there in the git history. > > > > Neal > > > > On Wed, Apr 7, 2021 at 7:30 AM Adam Lippai <a...@rigo.sk> wrote: > > > > > Hi Weston, > > > > > > Objective note: > > > I'm just a user, but I want to add that so far the Arrow releases are > > > pretty good quality which means you are making good calls. > > > > > > Personal opinion: > > > There were several annoying bugs, where one would have to change a > > > parameter between parquet V1/V2, threaded / non-threaded, but nothing > > > exceptional yet. > > > If you feel the work is ready and you are concerned about it's unusual > > > size, then I'd say go with the merge, my experience is that size is not > > > worrying on it's own, there is no need for extra caution. > > > If you feel it's under-tested, under-reviewed compared to the previous > > > code > > > then it's a different topic, it should be as good as the current > > > *average*. > > > You can make it the default behavior, if the bugs are critical, everybody > > > can stay on 3.x instead of 4.0 until 4.0.1 arrives or use workarounds (eg. > > > disable threading). > > > Version 4.0 is not a critical bugfix release one would need to upgrade to > > > instantly. > > > You wouldn't steal them the opportunity to lower the risks or resolve bugs > > > in production. > > > > > > Final thought: > > > While they are good enough, the releases in this field - like pandas, > > > dask > > > or airflow - can't be compared to how you deliver the new major versions, > > > so be proud and optimistic :) > > > > > > Best regards, > > > Adam Lippai > > > > > > On Wed, Apr 7, 2021 at 3:53 PM David Li <lidav...@apache.org> wrote: > > > > > > > Hey Weston, > > > > > > > > First, thanks for all your work in getting these changes so far. I think > > > > it's also been a valuable experience in working with async code, and > > > > hopefully the problems we've run into so far will help inform further > > > work, > > > > including with the query engine. > > > > > > > > If you're not comfortable merging, then we shouldn't try to rush it > > > > through, regardless of reviewer availability. If we're going to split it > > > > up, I would prefer the 'ScannerV2' approach, as while it'll clutter the > > > API > > > > for a bit, at least gives us a fallback if we observe threading issues > > > 'in > > > > the wild'. > > > > > > > > As for rxcpp - I took a quick look and while it doesn't seem abandoned, > > > at > > > > least the maintainers are no longer very active. While the library looks > > > > mature, there are (for example) issues filed for compatibility with > > > future > > > > C++ versions sitting unreviewed, and the blurb about it targeting only > > > the > > > > latest C++ compilers might not work so well for us. > > > > > > > > I think it may be useful to explicitly align the async generator > > > utilities > > > > with their rxcpp or ReactiveX equivalents so that there's some common > > > > concepts we can refer back to, especially if we further expand the > > > > utilities. While not many of us may be familiar with rxcpp already, at > > > > least we'd have a reference for how our utilities are supposed to work. > > > > > > > > Using the framework for query execution is an interesting point - doing > > > so > > > > might feel like wasted work but again, hopefully we could apply the > > > lessons > > > > here towards the query framework. (For instance, integrating any > > > > debug/tracing utilities we may have wanted, as with detecting > > > > 'abandoned' > > > > futures in ARROW-12207.) > > > > > > > > -David > > > > > > > > On 2021/04/07 17:18:30, Weston Pace <weston.p...@gmail.com> wrote: > > > > > I have been working the last few months on ARROW-7001 [0] which > > > > > enables nested parallelism by converting the dataset scanning to > > > > > asynchronous (previously announced here[1] and discussed here[2]). In > > > > > addition to enabling nested parallelism this also allows for parallel > > > > > readahead which gives significant benefits on parallel filesystems > > > > > (i.e. S3). > > > > > > > > > > The good news: The PR is passing CI (it looks like there is one > > > > > failure in python that may or may not be related. The RTools 3.5 > > > > > failure is expected at the moment). David Li has done some great work > > > > > investigating the performance benefits. There are substantial > > > > > benefits on S3 for both IPC [3] and Parquet [4] across a variety of > > > > > parameters. With additional work[5] the CSV reader could also reap > > > > > these benefits. > > > > > > > > > > The bad news: The change is significant. Previously Micah has > > > > > expressed some concern about the viralness of Futures. Also, when I > > > > > was requesting a pull review, Antoine expressed some concern about the > > > > > substantial amount of tricky code we are going to have to maintain in > > > > > src/arrow/util/async_generator.h. The change is also innately > > > > > error-prone as it deals with threading. Last week's trouble with the > > > > > R build was a direct result of some of the precursor work for this > > > > > feature. > > > > > > > > > > **For the above reasons I am sadly recommending that this feature not > > > > > target 4.0 as I had previously hoped.** > > > > > > > > > > For discussion: > > > > > > > > > > 1) Merging > > > > > > > > > > The change is significant (+1400/-730 atm). I have done my best to > > > > > break it into smaller PRs (most of these have since been merged). > > > > > However, I have reached this limit for how much it can reasonably be > > > > > split up. The main reason for this is that I am replacing an existing > > > > > feature. I cannot regress performance and I cannot regress > > > > > functionality. I could split the PR into smaller chunks if I could > > > > > waive either of those constraints. > > > > > > > > > > Possible solutions: > > > > > > > > > > - Just go for it. This would need someone to volunteer a substantial > > > > > chunk of time for review. > > > > > - Small chunks via branch. If someone could create a feature branch > > > > > I could break the PR into smaller commits which intentionally break > > > > > performance and functionality for the sake of piecemeal simplicity. > > > > > - Small chunks via "new API". This would be similar to how the > > > > > legacy ParquetDataset was converted to the new datasets API. A new > > > > > Scanner (ScannerV2, AsyncScanner, etc.) would be created that would > > > > > grow to have the same functionality as the old scanner and then the > > > > > higher level APIs would switch over. If this is the route we want to > > > > > go I could potentially have this new API available via C++ for 4.0 for > > > > > experimental use. > > > > > > > > > > 2) Testing > > > > > > > > > > There is very little inter-reader multi-threaded testing in the C++ > > > > > code base. I am not just talking about reading with "use_threads" set > > > > > to true or false but specifically issuing multiple reads at the same > > > > > time on potentially overlapping data sources. If the C++ code base is > > > > > going to support a query engine then it would seem this sort of > > > > > testing is essential. Assuming Ursa is willing I am proposing I start > > > > > work on this infrastructure during the 5.0 timeframe. I'll send out > > > > > additional email describing this work in more detail sometime after > > > > > the 4.0 release. > > > > > > > > > > 3) Maintenance > > > > > > > > > > src/arrow/util/async_generator.h has grown to nearly 1500 lines of > > > > > code. The short description of these utilities is that they are a > > > > > pull-based analogue to rxcpp[6]. The file contains the utilities > > > > > necessary to work with iterators (pull-driven) in an asynchronous > > > > > fashion. This code is fairly subtle and it could be difficult to > > > > > maintain, although I am happy to help do so for the foreseeable > > > > > future. It seems, at the very least, that there should be some Arrow > > > > > committer willing to take on learning and helping maintain the code. > > > > > > > > > > Other options: > > > > > > > > > > Although I have grown fond of the async generator code I probably have > > > > > some ugly baby syndrome. There are some other potential options we > > > > > could consider. > > > > > > > > > > - Just use rxcpp. It's an external library and would likely need > > > > > some glue to mesh with arrow::util::Future/arrow::util::Result. It is > > > > > also push-based instead of pull-based and that may require some > > > > > adjustment to the scanner. > > > > > - Wait and use the Arrow execution engine. There has been discussion > > > > > of a C++ based execution engine. From what I can gleam of the design > > > > > docs this would involve an actor framework of sorts and would probably > > > > > be compatible with dataset scanning. > > > > > > > > > > [0] https://github.com/apache/arrow/pull/9607 > > > > > [1] > > > > > > > https://mail-archives.apache.org/mod_mbox/arrow-dev/202102.mbox/%3CCAE4AYb18dWZ2DLo-9N-m2bCrk4zDkhUe6-0G%3D%2ByifLt%2B6hZivg%40mail.gmail.com%3E > > > > > [2] > > > > > > > https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing > > > > > [3] https://github.com/apache/arrow/pull/9656#discussion_r606289729 > > > > > [4] https://github.com/apache/arrow/pull/9620#issuecomment-814393897 > > > > > [5] https://issues.apache.org/jira/browse/ARROW-11889 > > > > > [6] https://github.com/ReactiveX/RxCpp > > > > > > > > > > > >