Re: [C++] (Eventually) merging asynchronous datasets feature

2021-04-08 Thread Weston Pace
Ok, so I have begun splitting ARROW-7001 into smaller tasks that eventually create an AsyncScanner. The plan... ARROW-12286 & ARROW-12287 are minor utilities that could have been split out anyways. ARROW-12288 Creates a `Scanner` interface and cleans up the existing implementation somewhat. This

Re: [C++] (Eventually) merging asynchronous datasets feature

2021-04-07 Thread Wes McKinney
I would also lean in the direction of progress to get user feedback sooner — if our test suite passes stably then it is probably okay to merge, and if it's possible (without great hardship) to have a fallback to the non-async version (so there's a workaround if there end up being show-stopping bugs

Re: [C++] (Eventually) merging asynchronous datasets feature

2021-04-07 Thread Weston Pace
1) Most of the committed changes have been off the main path. The only exception is the streaming CSV reader. Assuming ARROW-12208 is merged (it is close) a stable path would be to revert most of ARROW-12161 and change the legacy scanner to simply wrap each call to the streaming CSV reader with R

Re: [C++] (Eventually) merging asynchronous datasets feature

2021-04-07 Thread Neal Richardson
Three thoughts: 1. Given that lots of prerequisite patches have already merged, and we've seen some instability as a result of those, I don't think it's obviously true that holding ARROW-7001 out of 4.0 is lower risk. It could be that the intermediate state we're in now is higher risk. What do you

Re: [C++] (Eventually) merging asynchronous datasets feature

2021-04-07 Thread Adam Lippai
Hi Weston, Objective note: I'm just a user, but I want to add that so far the Arrow releases are pretty good quality which means you are making good calls. Personal opinion: There were several annoying bugs, where one would have to change a parameter between parquet V1/V2, threaded / non-threaded

Re: [C++] (Eventually) merging asynchronous datasets feature

2021-04-07 Thread David Li
Hey Weston, First, thanks for all your work in getting these changes so far. I think it's also been a valuable experience in working with async code, and hopefully the problems we've run into so far will help inform further work, including with the query engine. If you're not comfortable mergi

[C++] (Eventually) merging asynchronous datasets feature

2021-04-07 Thread Weston Pace
I have been working the last few months on ARROW-7001 [0] which enables nested parallelism by converting the dataset scanning to asynchronous (previously announced here[1] and discussed here[2]). In addition to enabling nested parallelism this also allows for parallel readahead which gives signifi