@david - I think all the readers (CSV, IPC, & parquet) will eventually
have support for some intra-file parallelism. You're also right I
think that there is some global consideration of max concurrent
operations. For example, if you readahead 10 files and 10 blocks in
each file but only have 16 c
I agree with making the decomposition of a fragment into tasks an
internal detail of the scan implementation. It seems that we want to
be moving toward a world of consuming a stream of
Future> and not pushing the complexity of
concurrency management (necessarily) onto the consumer. The nature of
mu
I agree we should present a simplified interface, and then also make ScanTask
internal, but I think that is orthogonal to whether a fragment produces one or
multiple scan tasks.
At first, my worry with having (Parquet)ScanTask handle concurrency itself was
that it does need to coordinate with
This is a bit of a follow-up on
https://issues.apache.org/jira/browse/ARROW-11782 and also a bit of a
consequence of my work on
https://issues.apache.org/jira/browse/ARROW-7001 (nested scan
parallelism).
I think the current dataset interface should be simplified.
Currently, we have Dataset ->* Fra