Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

David Li Mon, 23 Mar 2020 13:06:46 -0700

Thanks. I've set up an AWS account for my own testing for now. I've
also submitted a PR to add a basic benchmark which can be run
self-contained, against a local Minio instance, or against S3:
https://github.com/apache/arrow/pull/6675


I ran the benchmark from my local machine, and I can test from EC2
sometime as well. Performance is not ideal, but I'm being limited by
my home internet connection - coalescing small chunked reads is (as
expected) as fast as reading the file in one go, and in the PR
(testing against localhost where we're not limited by bandwidth), it's
faster than either option.

----------------------------------------------------------------------------------
Benchmark                                           Time           CPU
Iterations
----------------------------------------------------------------------------------
MinioFixture/ReadAll1Mib/real_time          223416933 ns    9098743 ns
       413   4.47594MB/s    4.47594 items/s
MinioFixture/ReadAll100Mib/real_time       6068938152 ns  553319299 ns
        10   16.4773MB/s   0.164773 items/s
MinioFixture/ReadAll500Mib/real_time       30735046155 ns 2620718364
ns          2   16.2681MB/s  0.0325361 items/s
MinioFixture/ReadChunked100Mib/real_time   9625661666 ns  448637141 ns
        12   10.3889MB/s   0.103889 items/s
MinioFixture/ReadChunked500Mib/real_time   58736796101 ns 2070237834
ns          2   8.51255MB/s  0.0170251 items/s
MinioFixture/ReadCoalesced100Mib/real_time 6982902546 ns   22553824 ns
        10   14.3207MB/s   0.143207 items/s
MinioFixture/ReadCoalesced500Mib/real_time 29923239648 ns  112736805
ns          3   16.7094MB/s  0.0334188 items/s
MinioFixture/ReadParquet250K/real_time     21934689795 ns 2052758161
ns          3   9.90955MB/s  0.0455899 items/s

Best,
David


On 3/22/20, Wes McKinney <wesmck...@gmail.com> wrote:
> On Thu, Mar 19, 2020 at 10:04 AM David Li <li.david...@gmail.com> wrote:
>>
>> > That's why it's important that we set ourselves up to do performance
>> > testing in a realistic environment in AWS rather than simulating it.
>>
>> For my clarification, what are the plans for this (if any)? I couldn't
>> find any prior discussion, though it sounds like the discussion around
>> cloud CI capacity would be one step towards this.
>>
>> In the short term we could make tests/benchmarks configurable to not
>> point at a Minio instance so individual developers can at least try
>> things.
>
> It probably makes sense to begin investing in somewhat portable
> tooling to assist with running S3-related unit tests and benchmarks
> inside AWS. This could include initial Parquet dataset generation and
> other things.
>
> As far as testing, I'm happy to pay for some AWS costs (within
> reason). AWS might be able to donate some credits to us also
>
>> Best,
>> David
>>
>> On 3/18/20, David Li <li.david...@gmail.com> wrote:
>> > For us it applies to S3-like systems, not only S3 itself, at least.
>> >
>> > It does make sense to limit it to some filesystems. The behavior would
>> > be opt-in at the Parquet reader level, so at the Datasets or
>> > Filesystem layer we can take care of enabling the flag for filesystems
>> > where it actually helps.
>> >
>> > I've filed these issues:
>> > - ARROW-8151 to benchmark S3File+Parquet
>> > (https://issues.apache.org/jira/browse/ARROW-8151)
>> > - ARROW-8152 to split large reads
>> > (https://issues.apache.org/jira/browse/ARROW-8152)
>> > - PARQUET-1820 to use a column filter hint with coalescing
>> > (https://issues.apache.org/jira/browse/PARQUET-1820)
>> >
>> > in addition to PARQUET-1698 which is just about pre-buffering the
>> > entire row group (which we can now do with ARROW-7995).
>> >
>> > Best,
>> > David
>> >
>> > On 3/18/20, Antoine Pitrou <anto...@python.org> wrote:
>> >>
>> >> Le 18/03/2020 à 18:30, David Li a écrit :
>> >>>> Instead of S3, you can use the Slow streams and Slow filesystem
>> >>>> implementations.  It may better protect against varying external
>> >>>> conditions.
>> >>>
>> >>> I think we'd want several different benchmarks - we want to ensure we
>> >>> don't regress local filesystem performance, and we also want to
>> >>> measure in an actual S3 environment. It would also be good to measure
>> >>> S3-compatible systems like Google's.
>> >>>
>> >>>>> - Use the coalescing inside the Parquet reader (even without a
>> >>>>> column
>> >>>>> filter hint - this would subsume PARQUET-1698)
>> >>>>
>> >>>> I'm assuming this would be done at the RowGroupReader level, right?
>> >>>
>> >>> Ideally we'd be able to coalesce across row groups as well, though
>> >>> maybe it'd be easier to start with within-row-group-only (I need to
>> >>> familiarize myself with the reader more).
>> >>>
>> >>>> I don't understand what the "advantage" would be.  Can you
>> >>>> elaborate?
>> >>>
>> >>> As Wes said, empirically you can get more bandwidth out of S3 with
>> >>> multiple concurrent HTTP requests. There is a cost to doing so
>> >>> (establishing a new connection takes time), hence why the coalescing
>> >>> tries to group small reads (to fully utilize one connection) and
>> >>> split
>> >>> large reads (to be able to take advantage of multiple connections).
>> >>
>> >> If that's S3-specific (or even AWS-specific) it might better be done
>> >> inside the S3 filesystem.  For other filesystems I don't think it
>> >> makes
>> >> sense to split reads.
>> >>
>> >> Regards
>> >>
>> >> Antoine.
>> >>
>> >
>

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to