Re: Query of Arrow Flight SQL with S3 as a storage for parquet files

Susmit Sarkar Wed, 16 Oct 2024 10:13:54 -0700

Thank you, will keep posted in the same thread

Regards,
Susmit


On Wed, Oct 16, 2024 at 9:45 PM Weston Pace <[email protected]> wrote:

> > Do you folks believe Duckdb and Datafusion (latter being similar to spark
> sql) will be an overkill?
>
> No, I don't believe it would be overkill.
>
> I also wouldn't compare either one to Spark SQL.  Spark SQL is meant to be
> a distributed query engine that typically requires a cluster of some sort
> to operate at full performance.  A distributed query engine would probably
> be overkill for your situation.
>
> Both DuckDb and Datafusion are meant to be lightweight, embeddable, single
> node (i.e. not distributed) query engine libraries.  These are probably a
> good fit for your use case.
>
> -Weston
>
> On Wed, Oct 16, 2024 at 8:17 AM Susmit Sarkar <[email protected]>
> wrote:
>
> > Thanks David and Felipe for your help, I will definitely try out and keep
> > you folks updated.
> >
> > Do you folks believe Duckdb and Datafusion (latter being similar to spark
> > sql) will be an overkill?
> >
> > Thanks,
> > Susmit
> >
> > On Wed, Oct 16, 2024 at 8:25 PM Felipe Oliveira Carvalho <
> > [email protected]> wrote:
> >
> > > Hi Susmit,
> > >
> > > For an example of what David Li is proposing, you can take a look at
> this
> > > project (https://github.com/voltrondata/sqlflite). It's a Flight SQL
> > > server
> > > (in C++ though) that can forward queries to either SQLite or DuckDB.
> > >
> > > --
> > > Felipe
> > >
> > > On Wed, Oct 16, 2024 at 10:22 AM David Li <[email protected]> wrote:
> > >
> > > > If your clients are sending full SQL queries to be executed, and you
> > need
> > > > to execute them against S3 on the server, why not consider something
> > like
> > > > Apache DataFusion or DuckDB to implement that part instead of
> building
> > > the
> > > > query parser/engine yourself? (There are probably already examples of
> > > > wrapping both these projects in Flight SQL floating around.)
> > > >
> > > > On Wed, Oct 16, 2024, at 21:38, Susmit Sarkar wrote:
> > > > > Hi Community Members
> > > > >
> > > > >
> > > > > We are planning to build an Arrow flight server on top of data
> lying
> > in
> > > > s3.
> > > > >
> > > > >
> > > > > *Detailed Use Case:*
> > > > >
> > > > >
> > > > > The requirement is we need to sync data from HDFS to a short term
> > > storage
> > > > > S3 is our case. Basically a DataSync Service between cloud storages
> > > > >
> > > > >
> > > > > I have already built the service using Apache Pekko / Akka HDFS &
> S3
> > > > > connectors, and data is in sync with HDFS & S3.
> > > > >
> > > > >
> > > > > Now comes the data reading part for end users. The data is stored
> in
> > > > > Cloudian s3 (Cloudian managed S3 not AWS) short term storage in
> > > parquet.
> > > > We
> > > > > want to build a Data as a Service on top of the data lying in S3
> and
> > > > expose
> > > > > API endpoints for clients to query. The data lying will be short
> > term,
> > > > data
> > > > > may be of week or months (max 3 months) use-cases varies from teams
> > to
> > > > > teams.
> > > > >
> > > > >
> > > > > So we felt Apache Sql Flight Server will be the best suited for our
> > use
> > > > > case and the client should send a FlightDescriptor object wrapped
> > with
> > > > the
> > > > > sql query.
> > > > >
> > > > >
> > > > > We parsed the query and query s3 using the aws s3 sdks, and return
> > the
> > > > > data, but the issue is we will end up building our own query
> parser,
> > > > which
> > > > > is a bigger task.
> > > > >
> > > > > Is there any other approach we can try out ?
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Susmit
> > > >
> > >
> >
>

Re: Query of Arrow Flight SQL with S3 as a storage for parquet files

Reply via email to