Ok, so I have begun splitting ARROW-7001 into smaller tasks that eventually create an AsyncScanner. The plan...
ARROW-12286 & ARROW-12287 are minor utilities that could have been split out anyways. ARROW-12288 Creates a `Scanner` interface and cleans up the existing implementation somewhat. This PR (along with ARROW-11797) are probably the most important to get community feedback on as they describe changes to the existing implementation. I'd also really like to ensure that these get included as part of 4.0 so we can start encouraging migration from `Scanner::Scan` to `Scanner::ScanBatches`. ARROW-12289 Will create an initial `AsyncScanner` that will have poor performance (it will be completely serial) but can use the existing formats as is. ARROW-TBD Once ARROW-12289 starts to get solidified we can add support for the different formats one format at a time (possibly reuse / adapt ARROW-11772 and ARROW-11843 for this). I've got a pretty solid rough draft of ARROW-12288 ready (https://github.com/apache/arrow/pull/9947) so I'd appreciate any feedback that can be provided. I still need to work on the Python/R (python just needs to add the use_async bool but R has a bit more) side a bit and will tackle that tomorrow. -Weston On Wed, Apr 7, 2021 at 8:55 AM Wes McKinney <wesmck...@gmail.com> wrote: > > I would also lean in the direction of progress to get user feedback > sooner — if our test suite passes stably then it is probably okay to > merge, and if it's possible (without great hardship) to have a > fallback to the non-async version (so there's a workaround if there > end up being show-stopping bugs) then that's even better. > > On Wed, Apr 7, 2021 at 1:24 PM Weston Pace <weston.p...@gmail.com> wrote: > > > > 1) Most of the committed changes have been off the main path. The > > only exception is the streaming CSV reader. Assuming ARROW-12208 is > > merged (it is close) a stable path would be to revert most of > > ARROW-12161 and change the legacy scanner to simply wrap each call to > > the streaming CSV reader with RunInSerialExecutor. This would allow > > us to keep the streaming CSV reader in so it isn't zero risk but that > > code has been fully reviewed and merged and has had some chance to run > > CI. This would also make the streaming CSV reader available for a V2 > > API. > > 2) Yes, barring any other feedback, I think I'm leaning in that direction. > > 3) If we keep the streaming CSV reader then I think David's IPC & > > Parquet changes have been mostly mirror paths or inside the file > > formats (which we could convert to a mirror path). I think the only > > challenge would be getting everything split by Friday but I'll give it > > a shot. > > > > On Wed, Apr 7, 2021 at 6:33 AM Neal Richardson > > <neal.p.richard...@gmail.com> wrote: > > > > > > Three thoughts: > > > > > > 1. Given that lots of prerequisite patches have already merged, and we've > > > seen some instability as a result of those, I don't think it's obviously > > > true that holding ARROW-7001 out of 4.0 is lower risk. It could be that > > > the > > > intermediate state we're in now is higher risk. What do you think? > > > 2. A great way to manage the risk is like what David said: ship the new > > > version as ScannerV2, and let client libraries be able to switch back to > > > the old scanner if they encounter an issue. Bugs are bad, but if there's a > > > fallback or workaround, they're less severe. And we're only going to find > > > some of the issues by putting the code out there in a release and getting > > > real usage on it. > > > 3. But given (1), I wonder how easy it is at this point to ship a separate > > > ScannerV2 that leaves the previous implementation working as it was. I > > > guess the old code is there in the git history. > > > > > > Neal > > > > > > On Wed, Apr 7, 2021 at 7:30 AM Adam Lippai <a...@rigo.sk> wrote: > > > > > > > Hi Weston, > > > > > > > > Objective note: > > > > I'm just a user, but I want to add that so far the Arrow releases are > > > > pretty good quality which means you are making good calls. > > > > > > > > Personal opinion: > > > > There were several annoying bugs, where one would have to change a > > > > parameter between parquet V1/V2, threaded / non-threaded, but nothing > > > > exceptional yet. > > > > If you feel the work is ready and you are concerned about it's unusual > > > > size, then I'd say go with the merge, my experience is that size is not > > > > worrying on it's own, there is no need for extra caution. > > > > If you feel it's under-tested, under-reviewed compared to the previous > > > > code > > > > then it's a different topic, it should be as good as the current > > > > *average*. > > > > You can make it the default behavior, if the bugs are critical, > > > > everybody > > > > can stay on 3.x instead of 4.0 until 4.0.1 arrives or use workarounds > > > > (eg. > > > > disable threading). > > > > Version 4.0 is not a critical bugfix release one would need to upgrade > > > > to > > > > instantly. > > > > You wouldn't steal them the opportunity to lower the risks or resolve > > > > bugs > > > > in production. > > > > > > > > Final thought: > > > > While they are good enough, the releases in this field - like pandas, > > > > dask > > > > or airflow - can't be compared to how you deliver the new major > > > > versions, > > > > so be proud and optimistic :) > > > > > > > > Best regards, > > > > Adam Lippai > > > > > > > > On Wed, Apr 7, 2021 at 3:53 PM David Li <lidav...@apache.org> wrote: > > > > > > > > > Hey Weston, > > > > > > > > > > First, thanks for all your work in getting these changes so far. I > > > > > think > > > > > it's also been a valuable experience in working with async code, and > > > > > hopefully the problems we've run into so far will help inform further > > > > work, > > > > > including with the query engine. > > > > > > > > > > If you're not comfortable merging, then we shouldn't try to rush it > > > > > through, regardless of reviewer availability. If we're going to split > > > > > it > > > > > up, I would prefer the 'ScannerV2' approach, as while it'll clutter > > > > > the > > > > API > > > > > for a bit, at least gives us a fallback if we observe threading issues > > > > 'in > > > > > the wild'. > > > > > > > > > > As for rxcpp - I took a quick look and while it doesn't seem > > > > > abandoned, > > > > at > > > > > least the maintainers are no longer very active. While the library > > > > > looks > > > > > mature, there are (for example) issues filed for compatibility with > > > > future > > > > > C++ versions sitting unreviewed, and the blurb about it targeting only > > > > the > > > > > latest C++ compilers might not work so well for us. > > > > > > > > > > I think it may be useful to explicitly align the async generator > > > > utilities > > > > > with their rxcpp or ReactiveX equivalents so that there's some common > > > > > concepts we can refer back to, especially if we further expand the > > > > > utilities. While not many of us may be familiar with rxcpp already, at > > > > > least we'd have a reference for how our utilities are supposed to > > > > > work. > > > > > > > > > > Using the framework for query execution is an interesting point - > > > > > doing > > > > so > > > > > might feel like wasted work but again, hopefully we could apply the > > > > lessons > > > > > here towards the query framework. (For instance, integrating any > > > > > debug/tracing utilities we may have wanted, as with detecting > > > > > 'abandoned' > > > > > futures in ARROW-12207.) > > > > > > > > > > -David > > > > > > > > > > On 2021/04/07 17:18:30, Weston Pace <weston.p...@gmail.com> wrote: > > > > > > I have been working the last few months on ARROW-7001 [0] which > > > > > > enables nested parallelism by converting the dataset scanning to > > > > > > asynchronous (previously announced here[1] and discussed here[2]). > > > > > > In > > > > > > addition to enabling nested parallelism this also allows for > > > > > > parallel > > > > > > readahead which gives significant benefits on parallel filesystems > > > > > > (i.e. S3). > > > > > > > > > > > > The good news: The PR is passing CI (it looks like there is one > > > > > > failure in python that may or may not be related. The RTools 3.5 > > > > > > failure is expected at the moment). David Li has done some great > > > > > > work > > > > > > investigating the performance benefits. There are substantial > > > > > > benefits on S3 for both IPC [3] and Parquet [4] across a variety of > > > > > > parameters. With additional work[5] the CSV reader could also reap > > > > > > these benefits. > > > > > > > > > > > > The bad news: The change is significant. Previously Micah has > > > > > > expressed some concern about the viralness of Futures. Also, when I > > > > > > was requesting a pull review, Antoine expressed some concern about > > > > > > the > > > > > > substantial amount of tricky code we are going to have to maintain > > > > > > in > > > > > > src/arrow/util/async_generator.h. The change is also innately > > > > > > error-prone as it deals with threading. Last week's trouble with > > > > > > the > > > > > > R build was a direct result of some of the precursor work for this > > > > > > feature. > > > > > > > > > > > > **For the above reasons I am sadly recommending that this feature > > > > > > not > > > > > > target 4.0 as I had previously hoped.** > > > > > > > > > > > > For discussion: > > > > > > > > > > > > 1) Merging > > > > > > > > > > > > The change is significant (+1400/-730 atm). I have done my best to > > > > > > break it into smaller PRs (most of these have since been merged). > > > > > > However, I have reached this limit for how much it can reasonably be > > > > > > split up. The main reason for this is that I am replacing an > > > > > > existing > > > > > > feature. I cannot regress performance and I cannot regress > > > > > > functionality. I could split the PR into smaller chunks if I could > > > > > > waive either of those constraints. > > > > > > > > > > > > Possible solutions: > > > > > > > > > > > > - Just go for it. This would need someone to volunteer a > > > > > > substantial > > > > > > chunk of time for review. > > > > > > - Small chunks via branch. If someone could create a feature > > > > > > branch > > > > > > I could break the PR into smaller commits which intentionally break > > > > > > performance and functionality for the sake of piecemeal simplicity. > > > > > > - Small chunks via "new API". This would be similar to how the > > > > > > legacy ParquetDataset was converted to the new datasets API. A new > > > > > > Scanner (ScannerV2, AsyncScanner, etc.) would be created that would > > > > > > grow to have the same functionality as the old scanner and then the > > > > > > higher level APIs would switch over. If this is the route we want > > > > > > to > > > > > > go I could potentially have this new API available via C++ for 4.0 > > > > > > for > > > > > > experimental use. > > > > > > > > > > > > 2) Testing > > > > > > > > > > > > There is very little inter-reader multi-threaded testing in the C++ > > > > > > code base. I am not just talking about reading with "use_threads" > > > > > > set > > > > > > to true or false but specifically issuing multiple reads at the same > > > > > > time on potentially overlapping data sources. If the C++ code base > > > > > > is > > > > > > going to support a query engine then it would seem this sort of > > > > > > testing is essential. Assuming Ursa is willing I am proposing I > > > > > > start > > > > > > work on this infrastructure during the 5.0 timeframe. I'll send out > > > > > > additional email describing this work in more detail sometime after > > > > > > the 4.0 release. > > > > > > > > > > > > 3) Maintenance > > > > > > > > > > > > src/arrow/util/async_generator.h has grown to nearly 1500 lines of > > > > > > code. The short description of these utilities is that they are a > > > > > > pull-based analogue to rxcpp[6]. The file contains the utilities > > > > > > necessary to work with iterators (pull-driven) in an asynchronous > > > > > > fashion. This code is fairly subtle and it could be difficult to > > > > > > maintain, although I am happy to help do so for the foreseeable > > > > > > future. It seems, at the very least, that there should be some > > > > > > Arrow > > > > > > committer willing to take on learning and helping maintain the code. > > > > > > > > > > > > Other options: > > > > > > > > > > > > Although I have grown fond of the async generator code I probably > > > > > > have > > > > > > some ugly baby syndrome. There are some other potential options we > > > > > > could consider. > > > > > > > > > > > > - Just use rxcpp. It's an external library and would likely need > > > > > > some glue to mesh with arrow::util::Future/arrow::util::Result. It > > > > > > is > > > > > > also push-based instead of pull-based and that may require some > > > > > > adjustment to the scanner. > > > > > > - Wait and use the Arrow execution engine. There has been > > > > > > discussion > > > > > > of a C++ based execution engine. From what I can gleam of the > > > > > > design > > > > > > docs this would involve an actor framework of sorts and would > > > > > > probably > > > > > > be compatible with dataset scanning. > > > > > > > > > > > > [0] https://github.com/apache/arrow/pull/9607 > > > > > > [1] > > > > > > > > > https://mail-archives.apache.org/mod_mbox/arrow-dev/202102.mbox/%3CCAE4AYb18dWZ2DLo-9N-m2bCrk4zDkhUe6-0G%3D%2ByifLt%2B6hZivg%40mail.gmail.com%3E > > > > > > [2] > > > > > > > > > https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing > > > > > > [3] https://github.com/apache/arrow/pull/9656#discussion_r606289729 > > > > > > [4] https://github.com/apache/arrow/pull/9620#issuecomment-814393897 > > > > > > [5] https://issues.apache.org/jira/browse/ARROW-11889 > > > > > > [6] https://github.com/ReactiveX/RxCpp > > > > > > > > > > > > > > >