@david - I think all the readers (CSV, IPC, & parquet) will eventually
have support for some intra-file parallelism. You're also right I
think that there is some global consideration of max concurrent
operations. For example, if you readahead 10 files and 10 blocks in
each file but only have 16 c
I made a proposal a while ago that covers a form of RLE encoding [1]. I
haven't had time to work on it, since it is a substantial effort to
implement.
I wouldn't expect an intern to be able to complete the work necessary to
get this merged over the course of a normal 3 month internship.
[1] http
OK, originally this was part of
https://issues.apache.org/jira/browse/ARROW-12006 and I was going to just add
some doc on flatc, but I will make this a new bug because it's a little bigger:
https://issues.apache.org/jira/browse/ARROW-12111
On 2021/03/23 23:40:50, Micah Kornfield wrote:
> >
>
I agree with making the decomposition of a fragment into tasks an
internal detail of the scan implementation. It seems that we want to
be moving toward a world of consuming a stream of
Future> and not pushing the complexity of
concurrency management (necessarily) onto the consumer. The nature of
mu
I agree we should present a simplified interface, and then also make ScanTask
internal, but I think that is orthogonal to whether a fragment produces one or
multiple scan tasks.
At first, my worry with having (Parquet)ScanTask handle concurrency itself was
that it does need to coordinate with
Arrow Build Report for Job nightly-2021-03-26-0
All tasks:
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-26-0
Failed Tasks:
- conda-linux-gcc-py37-aarch64:
URL:
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-26-0-drone-conda-linux