Re: [C++] Dataset API simplification

2021-03-26 Thread Weston Pace
@david - I think all the readers (CSV, IPC, & parquet) will eventually have support for some intra-file parallelism. You're also right I think that there is some global consideration of max concurrent operations. For example, if you readahead 10 files and 10 blocks in each file but only have 16 c

Re: sparse data array

2021-03-26 Thread Micah Kornfield
I made a proposal a while ago that covers a form of RLE encoding [1]. I haven't had time to work on it, since it is a substantial effort to implement. I wouldn't expect an intern to be able to complete the work necessary to get this merged over the course of a normal 3 month internship. [1] http

Re: [Java] Source control of generated flatbuffers code

2021-03-26 Thread bobtins
OK, originally this was part of https://issues.apache.org/jira/browse/ARROW-12006 and I was going to just add some doc on flatc, but I will make this a new bug because it's a little bigger: https://issues.apache.org/jira/browse/ARROW-12111 On 2021/03/23 23:40:50, Micah Kornfield wrote: > > >

Re: [C++] Dataset API simplification

2021-03-26 Thread Wes McKinney
I agree with making the decomposition of a fragment into tasks an internal detail of the scan implementation. It seems that we want to be moving toward a world of consuming a stream of Future> and not pushing the complexity of concurrency management (necessarily) onto the consumer. The nature of mu

Re: [C++] Dataset API simplification

2021-03-26 Thread David Li
I agree we should present a simplified interface, and then also make ScanTask internal, but I think that is orthogonal to whether a fragment produces one or multiple scan tasks. At first, my worry with having (Parquet)ScanTask handle concurrency itself was that it does need to coordinate with

[NIGHTLY] Arrow Build Report for Job nightly-2021-03-26-0

2021-03-26 Thread Crossbow
Arrow Build Report for Job nightly-2021-03-26-0 All tasks: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-26-0 Failed Tasks: - conda-linux-gcc-py37-aarch64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-26-0-drone-conda-linux