Three thoughts: 1. Given that lots of prerequisite patches have already merged, and we've seen some instability as a result of those, I don't think it's obviously true that holding ARROW-7001 out of 4.0 is lower risk. It could be that the intermediate state we're in now is higher risk. What do you think? 2. A great way to manage the risk is like what David said: ship the new version as ScannerV2, and let client libraries be able to switch back to the old scanner if they encounter an issue. Bugs are bad, but if there's a fallback or workaround, they're less severe. And we're only going to find some of the issues by putting the code out there in a release and getting real usage on it. 3. But given (1), I wonder how easy it is at this point to ship a separate ScannerV2 that leaves the previous implementation working as it was. I guess the old code is there in the git history.
Neal On Wed, Apr 7, 2021 at 7:30 AM Adam Lippai <a...@rigo.sk> wrote: > Hi Weston, > > Objective note: > I'm just a user, but I want to add that so far the Arrow releases are > pretty good quality which means you are making good calls. > > Personal opinion: > There were several annoying bugs, where one would have to change a > parameter between parquet V1/V2, threaded / non-threaded, but nothing > exceptional yet. > If you feel the work is ready and you are concerned about it's unusual > size, then I'd say go with the merge, my experience is that size is not > worrying on it's own, there is no need for extra caution. > If you feel it's under-tested, under-reviewed compared to the previous code > then it's a different topic, it should be as good as the current *average*. > You can make it the default behavior, if the bugs are critical, everybody > can stay on 3.x instead of 4.0 until 4.0.1 arrives or use workarounds (eg. > disable threading). > Version 4.0 is not a critical bugfix release one would need to upgrade to > instantly. > You wouldn't steal them the opportunity to lower the risks or resolve bugs > in production. > > Final thought: > While they are good enough, the releases in this field - like pandas, dask > or airflow - can't be compared to how you deliver the new major versions, > so be proud and optimistic :) > > Best regards, > Adam Lippai > > On Wed, Apr 7, 2021 at 3:53 PM David Li <lidav...@apache.org> wrote: > > > Hey Weston, > > > > First, thanks for all your work in getting these changes so far. I think > > it's also been a valuable experience in working with async code, and > > hopefully the problems we've run into so far will help inform further > work, > > including with the query engine. > > > > If you're not comfortable merging, then we shouldn't try to rush it > > through, regardless of reviewer availability. If we're going to split it > > up, I would prefer the 'ScannerV2' approach, as while it'll clutter the > API > > for a bit, at least gives us a fallback if we observe threading issues > 'in > > the wild'. > > > > As for rxcpp - I took a quick look and while it doesn't seem abandoned, > at > > least the maintainers are no longer very active. While the library looks > > mature, there are (for example) issues filed for compatibility with > future > > C++ versions sitting unreviewed, and the blurb about it targeting only > the > > latest C++ compilers might not work so well for us. > > > > I think it may be useful to explicitly align the async generator > utilities > > with their rxcpp or ReactiveX equivalents so that there's some common > > concepts we can refer back to, especially if we further expand the > > utilities. While not many of us may be familiar with rxcpp already, at > > least we'd have a reference for how our utilities are supposed to work. > > > > Using the framework for query execution is an interesting point - doing > so > > might feel like wasted work but again, hopefully we could apply the > lessons > > here towards the query framework. (For instance, integrating any > > debug/tracing utilities we may have wanted, as with detecting 'abandoned' > > futures in ARROW-12207.) > > > > -David > > > > On 2021/04/07 17:18:30, Weston Pace <weston.p...@gmail.com> wrote: > > > I have been working the last few months on ARROW-7001 [0] which > > > enables nested parallelism by converting the dataset scanning to > > > asynchronous (previously announced here[1] and discussed here[2]). In > > > addition to enabling nested parallelism this also allows for parallel > > > readahead which gives significant benefits on parallel filesystems > > > (i.e. S3). > > > > > > The good news: The PR is passing CI (it looks like there is one > > > failure in python that may or may not be related. The RTools 3.5 > > > failure is expected at the moment). David Li has done some great work > > > investigating the performance benefits. There are substantial > > > benefits on S3 for both IPC [3] and Parquet [4] across a variety of > > > parameters. With additional work[5] the CSV reader could also reap > > > these benefits. > > > > > > The bad news: The change is significant. Previously Micah has > > > expressed some concern about the viralness of Futures. Also, when I > > > was requesting a pull review, Antoine expressed some concern about the > > > substantial amount of tricky code we are going to have to maintain in > > > src/arrow/util/async_generator.h. The change is also innately > > > error-prone as it deals with threading. Last week's trouble with the > > > R build was a direct result of some of the precursor work for this > > > feature. > > > > > > **For the above reasons I am sadly recommending that this feature not > > > target 4.0 as I had previously hoped.** > > > > > > For discussion: > > > > > > 1) Merging > > > > > > The change is significant (+1400/-730 atm). I have done my best to > > > break it into smaller PRs (most of these have since been merged). > > > However, I have reached this limit for how much it can reasonably be > > > split up. The main reason for this is that I am replacing an existing > > > feature. I cannot regress performance and I cannot regress > > > functionality. I could split the PR into smaller chunks if I could > > > waive either of those constraints. > > > > > > Possible solutions: > > > > > > - Just go for it. This would need someone to volunteer a substantial > > > chunk of time for review. > > > - Small chunks via branch. If someone could create a feature branch > > > I could break the PR into smaller commits which intentionally break > > > performance and functionality for the sake of piecemeal simplicity. > > > - Small chunks via "new API". This would be similar to how the > > > legacy ParquetDataset was converted to the new datasets API. A new > > > Scanner (ScannerV2, AsyncScanner, etc.) would be created that would > > > grow to have the same functionality as the old scanner and then the > > > higher level APIs would switch over. If this is the route we want to > > > go I could potentially have this new API available via C++ for 4.0 for > > > experimental use. > > > > > > 2) Testing > > > > > > There is very little inter-reader multi-threaded testing in the C++ > > > code base. I am not just talking about reading with "use_threads" set > > > to true or false but specifically issuing multiple reads at the same > > > time on potentially overlapping data sources. If the C++ code base is > > > going to support a query engine then it would seem this sort of > > > testing is essential. Assuming Ursa is willing I am proposing I start > > > work on this infrastructure during the 5.0 timeframe. I'll send out > > > additional email describing this work in more detail sometime after > > > the 4.0 release. > > > > > > 3) Maintenance > > > > > > src/arrow/util/async_generator.h has grown to nearly 1500 lines of > > > code. The short description of these utilities is that they are a > > > pull-based analogue to rxcpp[6]. The file contains the utilities > > > necessary to work with iterators (pull-driven) in an asynchronous > > > fashion. This code is fairly subtle and it could be difficult to > > > maintain, although I am happy to help do so for the foreseeable > > > future. It seems, at the very least, that there should be some Arrow > > > committer willing to take on learning and helping maintain the code. > > > > > > Other options: > > > > > > Although I have grown fond of the async generator code I probably have > > > some ugly baby syndrome. There are some other potential options we > > > could consider. > > > > > > - Just use rxcpp. It's an external library and would likely need > > > some glue to mesh with arrow::util::Future/arrow::util::Result. It is > > > also push-based instead of pull-based and that may require some > > > adjustment to the scanner. > > > - Wait and use the Arrow execution engine. There has been discussion > > > of a C++ based execution engine. From what I can gleam of the design > > > docs this would involve an actor framework of sorts and would probably > > > be compatible with dataset scanning. > > > > > > [0] https://github.com/apache/arrow/pull/9607 > > > [1] > > > https://mail-archives.apache.org/mod_mbox/arrow-dev/202102.mbox/%3CCAE4AYb18dWZ2DLo-9N-m2bCrk4zDkhUe6-0G%3D%2ByifLt%2B6hZivg%40mail.gmail.com%3E > > > [2] > > > https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing > > > [3] https://github.com/apache/arrow/pull/9656#discussion_r606289729 > > > [4] https://github.com/apache/arrow/pull/9620#issuecomment-814393897 > > > [5] https://issues.apache.org/jira/browse/ARROW-11889 > > > [6] https://github.com/ReactiveX/RxCpp > > > > > >