Re: [C++] Dataset API simplification

2021-03-26 Thread Weston Pace
@david - I think all the readers (CSV, IPC, & parquet) will eventually have support for some intra-file parallelism. You're also right I think that there is some global consideration of max concurrent operations. For example, if you readahead 10 files and 10 blocks in each file but only have 16 c

Re: [C++] Dataset API simplification

2021-03-26 Thread Wes McKinney
I agree with making the decomposition of a fragment into tasks an internal detail of the scan implementation. It seems that we want to be moving toward a world of consuming a stream of Future> and not pushing the complexity of concurrency management (necessarily) onto the consumer. The nature of mu

Re: [C++] Dataset API simplification

2021-03-26 Thread David Li
I agree we should present a simplified interface, and then also make ScanTask internal, but I think that is orthogonal to whether a fragment produces one or multiple scan tasks. At first, my worry with having (Parquet)ScanTask handle concurrency itself was that it does need to coordinate with

[C++] Dataset API simplification

2021-03-25 Thread Weston Pace
This is a bit of a follow-up on https://issues.apache.org/jira/browse/ARROW-11782 and also a bit of a consequence of my work on https://issues.apache.org/jira/browse/ARROW-7001 (nested scan parallelism). I think the current dataset interface should be simplified. Currently, we have Dataset ->* Fra