Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Wes McKinney Sun, 22 Mar 2020 14:52:08 -0700

On Thu, Mar 19, 2020 at 10:04 AM David Li <[email protected]> wrote:
>
> > That's why it's important that we set ourselves up to do performance 
> > testing in a realistic environment in AWS rather than simulating it.
>
> For my clarification, what are the plans for this (if any)? I couldn't
> find any prior discussion, though it sounds like the discussion around
> cloud CI capacity would be one step towards this.
>
> In the short term we could make tests/benchmarks configurable to not
> point at a Minio instance so individual developers can at least try
> things.


It probably makes sense to begin investing in somewhat portable
tooling to assist with running S3-related unit tests and benchmarks
inside AWS. This could include initial Parquet dataset generation and
other things.

As far as testing, I'm happy to pay for some AWS costs (within
reason). AWS might be able to donate some credits to us also

> Best,
> David
>
> On 3/18/20, David Li <[email protected]> wrote:
> > For us it applies to S3-like systems, not only S3 itself, at least.
> >
> > It does make sense to limit it to some filesystems. The behavior would
> > be opt-in at the Parquet reader level, so at the Datasets or
> > Filesystem layer we can take care of enabling the flag for filesystems
> > where it actually helps.
> >
> > I've filed these issues:
> > - ARROW-8151 to benchmark S3File+Parquet
> > (https://issues.apache.org/jira/browse/ARROW-8151)
> > - ARROW-8152 to split large reads
> > (https://issues.apache.org/jira/browse/ARROW-8152)
> > - PARQUET-1820 to use a column filter hint with coalescing
> > (https://issues.apache.org/jira/browse/PARQUET-1820)
> >
> > in addition to PARQUET-1698 which is just about pre-buffering the
> > entire row group (which we can now do with ARROW-7995).
> >
> > Best,
> > David
> >
> > On 3/18/20, Antoine Pitrou <[email protected]> wrote:
> >>
> >> Le 18/03/2020 à 18:30, David Li a écrit :
> >>>> Instead of S3, you can use the Slow streams and Slow filesystem
> >>>> implementations.  It may better protect against varying external
> >>>> conditions.
> >>>
> >>> I think we'd want several different benchmarks - we want to ensure we
> >>> don't regress local filesystem performance, and we also want to
> >>> measure in an actual S3 environment. It would also be good to measure
> >>> S3-compatible systems like Google's.
> >>>
> >>>>> - Use the coalescing inside the Parquet reader (even without a column
> >>>>> filter hint - this would subsume PARQUET-1698)
> >>>>
> >>>> I'm assuming this would be done at the RowGroupReader level, right?
> >>>
> >>> Ideally we'd be able to coalesce across row groups as well, though
> >>> maybe it'd be easier to start with within-row-group-only (I need to
> >>> familiarize myself with the reader more).
> >>>
> >>>> I don't understand what the "advantage" would be.  Can you elaborate?
> >>>
> >>> As Wes said, empirically you can get more bandwidth out of S3 with
> >>> multiple concurrent HTTP requests. There is a cost to doing so
> >>> (establishing a new connection takes time), hence why the coalescing
> >>> tries to group small reads (to fully utilize one connection) and split
> >>> large reads (to be able to take advantage of multiple connections).
> >>
> >> If that's S3-specific (or even AWS-specific) it might better be done
> >> inside the S3 filesystem.  For other filesystems I don't think it makes
> >> sense to split reads.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to