Thanks. I've set up an AWS account for my own testing for now. I've also submitted a PR to add a basic benchmark which can be run self-contained, against a local Minio instance, or against S3: https://github.com/apache/arrow/pull/6675
I ran the benchmark from my local machine, and I can test from EC2 sometime as well. Performance is not ideal, but I'm being limited by my home internet connection - coalescing small chunked reads is (as expected) as fast as reading the file in one go, and in the PR (testing against localhost where we're not limited by bandwidth), it's faster than either option. ---------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------- MinioFixture/ReadAll1Mib/real_time 223416933 ns 9098743 ns 413 4.47594MB/s 4.47594 items/s MinioFixture/ReadAll100Mib/real_time 6068938152 ns 553319299 ns 10 16.4773MB/s 0.164773 items/s MinioFixture/ReadAll500Mib/real_time 30735046155 ns 2620718364 ns 2 16.2681MB/s 0.0325361 items/s MinioFixture/ReadChunked100Mib/real_time 9625661666 ns 448637141 ns 12 10.3889MB/s 0.103889 items/s MinioFixture/ReadChunked500Mib/real_time 58736796101 ns 2070237834 ns 2 8.51255MB/s 0.0170251 items/s MinioFixture/ReadCoalesced100Mib/real_time 6982902546 ns 22553824 ns 10 14.3207MB/s 0.143207 items/s MinioFixture/ReadCoalesced500Mib/real_time 29923239648 ns 112736805 ns 3 16.7094MB/s 0.0334188 items/s MinioFixture/ReadParquet250K/real_time 21934689795 ns 2052758161 ns 3 9.90955MB/s 0.0455899 items/s Best, David On 3/22/20, Wes McKinney <wesmck...@gmail.com> wrote: > On Thu, Mar 19, 2020 at 10:04 AM David Li <li.david...@gmail.com> wrote: >> >> > That's why it's important that we set ourselves up to do performance >> > testing in a realistic environment in AWS rather than simulating it. >> >> For my clarification, what are the plans for this (if any)? I couldn't >> find any prior discussion, though it sounds like the discussion around >> cloud CI capacity would be one step towards this. >> >> In the short term we could make tests/benchmarks configurable to not >> point at a Minio instance so individual developers can at least try >> things. > > It probably makes sense to begin investing in somewhat portable > tooling to assist with running S3-related unit tests and benchmarks > inside AWS. This could include initial Parquet dataset generation and > other things. > > As far as testing, I'm happy to pay for some AWS costs (within > reason). AWS might be able to donate some credits to us also > >> Best, >> David >> >> On 3/18/20, David Li <li.david...@gmail.com> wrote: >> > For us it applies to S3-like systems, not only S3 itself, at least. >> > >> > It does make sense to limit it to some filesystems. The behavior would >> > be opt-in at the Parquet reader level, so at the Datasets or >> > Filesystem layer we can take care of enabling the flag for filesystems >> > where it actually helps. >> > >> > I've filed these issues: >> > - ARROW-8151 to benchmark S3File+Parquet >> > (https://issues.apache.org/jira/browse/ARROW-8151) >> > - ARROW-8152 to split large reads >> > (https://issues.apache.org/jira/browse/ARROW-8152) >> > - PARQUET-1820 to use a column filter hint with coalescing >> > (https://issues.apache.org/jira/browse/PARQUET-1820) >> > >> > in addition to PARQUET-1698 which is just about pre-buffering the >> > entire row group (which we can now do with ARROW-7995). >> > >> > Best, >> > David >> > >> > On 3/18/20, Antoine Pitrou <anto...@python.org> wrote: >> >> >> >> Le 18/03/2020 à 18:30, David Li a écrit : >> >>>> Instead of S3, you can use the Slow streams and Slow filesystem >> >>>> implementations. It may better protect against varying external >> >>>> conditions. >> >>> >> >>> I think we'd want several different benchmarks - we want to ensure we >> >>> don't regress local filesystem performance, and we also want to >> >>> measure in an actual S3 environment. It would also be good to measure >> >>> S3-compatible systems like Google's. >> >>> >> >>>>> - Use the coalescing inside the Parquet reader (even without a >> >>>>> column >> >>>>> filter hint - this would subsume PARQUET-1698) >> >>>> >> >>>> I'm assuming this would be done at the RowGroupReader level, right? >> >>> >> >>> Ideally we'd be able to coalesce across row groups as well, though >> >>> maybe it'd be easier to start with within-row-group-only (I need to >> >>> familiarize myself with the reader more). >> >>> >> >>>> I don't understand what the "advantage" would be. Can you >> >>>> elaborate? >> >>> >> >>> As Wes said, empirically you can get more bandwidth out of S3 with >> >>> multiple concurrent HTTP requests. There is a cost to doing so >> >>> (establishing a new connection takes time), hence why the coalescing >> >>> tries to group small reads (to fully utilize one connection) and >> >>> split >> >>> large reads (to be able to take advantage of multiple connections). >> >> >> >> If that's S3-specific (or even AWS-specific) it might better be done >> >> inside the S3 filesystem. For other filesystems I don't think it >> >> makes >> >> sense to split reads. >> >> >> >> Regards >> >> >> >> Antoine. >> >> >> > >