I couldn't get S3 select to work with a limit (it's not designed for that),
I have used clickhouse with S3 select with Arrow support (for snowflake)
and it works great.

On Sat, 13 Feb 2021, 5:24 pm Rémi Dettai, <rdet...@gmail.com> wrote:

> Thank you Daniel for taking the time to go through the slides!
>
> S3 select is an interesting beast, but I think the benefit we could draw
> from it in this usecase is pretty limited:
> - for now Buzz focuses on Parquet data, which already allows efficient
> projection capabilities (it uses HTTP Range requests to download only the
> relevant parts of the files) and once supported by datafusion, we might
> even push down filters to skip downloading entire row groups.
> - S3 select can only output CSV and JSON, so in the cases where you have to
> bring back a lot of data, it would actually amplify the volumes of data
> fetched from s3 and make the deserialization more expensive.
>
> There are still some situations where S3 select would definitely be
> beneficial, but it would be quite hard to automatically identify those and
> let S3 Select kick accordingly.
>
> Have you used S3 Select at scale? Does it provide good and consistent
> latencies?
>
> Le mer. 10 févr. 2021 à 19:35, Daniël Heres <danielhe...@gmail.com> a
> écrit :
>
> > Thanks for sharing the slides Rémi! That looks really cool.
> >
> > One question I have after this, do you plan to use S3 Select (
> > https://aws.amazon.com/blogs/aws/s3-glacier-select/)?Seems it would fit
> > your architecture nicely and I think shouldn't be too hard to create the
> > query from the filters/projection in the datasource scan method to spend
> > less time in Lambda.
> >
> > On Wed, Feb 10, 2021, 18:44 Rémi Dettai <rdet...@gmail.com> wrote:
> >
> > > Thanks for the notes Andy. Here is the slide deck I presented, for
> > further
> > > reference:
> > >
> > >
> >
> https://docs.google.com/presentation/d/1uZ5PbazC1zCX24k0Hh-UItddIh9BRvD5GL7NUDgc9eQ/edit?usp=sharing
> > >
> > > If anyone wants to see how it works in practice and does not have an
> AWS
> > > account to try it out, feel free to reach out to me and I can walk you
> > > through it!
> > >
> > > Le mer. 10 févr. 2021 à 18:37, Andy Grove <andygrov...@gmail.com> a
> > écrit
> > > :
> > >
> > > > Attendees
> > > >
> > > >
> > > >    -
> > > >
> > > >    Andy Grove
> > > >    -
> > > >
> > > >    Benjamin Blodgett
> > > >    -
> > > >
> > > >    Marc Prud’Hommeaux
> > > >    -
> > > >
> > > >    Mike Seddon
> > > >    -
> > > >
> > > >    Jorge Leitao
> > > >    -
> > > >
> > > >    Andrew Lamb
> > > >    -
> > > >
> > > >    Fernando Herrera
> > > >    -
> > > >
> > > >    Neville Dipale
> > > >    -
> > > >
> > > >    Remi Dettai
> > > >
> > > >
> > > > (Please let me know if I have misspelled anyone’s names)
> > > >
> > > > Topics Discussed
> > > >
> > > >
> > > >    -
> > > >
> > > >    Discussion of Jorge’s proposal to redesign Arrow crate to resolve
> > > safety
> > > >    violations (following on from mailing list discussion)
> > > >    -
> > > >
> > > >    Mike has a PR up to implement a large number of Postgres string
> > > >    functions that needs reviewing
> > > >    -
> > > >
> > > >    Remi gave a short presentation about his Buzz project which
> provides
> > > >    serverless compute using Arrow and DataFusion
> > > >
> > > >
> > > > Planned for next time:
> > > >
> > > >
> > > >    -
> > > >
> > > >    Marc Prud’Hommeaux to give a presentation/demo on his use of Arrow
> > > >    -
> > > >
> > > >    Andy Grove to give a presentation/demo on Ballista, which provides
> > > >    distributed query execution using DataFusion
> > > >
> > > >
> > > > On Wed, Feb 10, 2021 at 8:56 AM Andy Grove <andygrov...@gmail.com>
> > > wrote:
> > > >
> > > > > A quick reminder that the bi-weekly Arrow Rust sync call starts
> about
> > > an
> > > > > hour from now. Everyone is welcome.
> > > > >
> > > > > https://meet.google.com/ctp-yujs-aee
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > > >
> > > >
> > >
> >
>

Reply via email to