Re: Query of Arrow Flight SQL with S3 as a storage for parquet files

Weston Pace Wed, 16 Oct 2024 09:15:25 -0700

> Do you folks believe Duckdb and Datafusion (latter being similar to spark
sql) will be an overkill?


No, I don't believe it would be overkill.

I also wouldn't compare either one to Spark SQL.  Spark SQL is meant to be
a distributed query engine that typically requires a cluster of some sort
to operate at full performance.  A distributed query engine would probably
be overkill for your situation.

Both DuckDb and Datafusion are meant to be lightweight, embeddable, single
node (i.e. not distributed) query engine libraries.  These are probably a
good fit for your use case.

-Weston

On Wed, Oct 16, 2024 at 8:17 AM Susmit Sarkar <[email protected]>
wrote:

> Thanks David and Felipe for your help, I will definitely try out and keep
> you folks updated.
>
> Do you folks believe Duckdb and Datafusion (latter being similar to spark
> sql) will be an overkill?
>
> Thanks,
> Susmit
>
> On Wed, Oct 16, 2024 at 8:25 PM Felipe Oliveira Carvalho <
> [email protected]> wrote:
>
> > Hi Susmit,
> >
> > For an example of what David Li is proposing, you can take a look at this
> > project (https://github.com/voltrondata/sqlflite). It's a Flight SQL
> > server
> > (in C++ though) that can forward queries to either SQLite or DuckDB.
> >
> > --
> > Felipe
> >
> > On Wed, Oct 16, 2024 at 10:22 AM David Li <[email protected]> wrote:
> >
> > > If your clients are sending full SQL queries to be executed, and you
> need
> > > to execute them against S3 on the server, why not consider something
> like
> > > Apache DataFusion or DuckDB to implement that part instead of building
> > the
> > > query parser/engine yourself? (There are probably already examples of
> > > wrapping both these projects in Flight SQL floating around.)
> > >
> > > On Wed, Oct 16, 2024, at 21:38, Susmit Sarkar wrote:
> > > > Hi Community Members
> > > >
> > > >
> > > > We are planning to build an Arrow flight server on top of data lying
> in
> > > s3.
> > > >
> > > >
> > > > *Detailed Use Case:*
> > > >
> > > >
> > > > The requirement is we need to sync data from HDFS to a short term
> > storage
> > > > S3 is our case. Basically a DataSync Service between cloud storages
> > > >
> > > >
> > > > I have already built the service using Apache Pekko / Akka HDFS & S3
> > > > connectors, and data is in sync with HDFS & S3.
> > > >
> > > >
> > > > Now comes the data reading part for end users. The data is stored in
> > > > Cloudian s3 (Cloudian managed S3 not AWS) short term storage in
> > parquet.
> > > We
> > > > want to build a Data as a Service on top of the data lying in S3 and
> > > expose
> > > > API endpoints for clients to query. The data lying will be short
> term,
> > > data
> > > > may be of week or months (max 3 months) use-cases varies from teams
> to
> > > > teams.
> > > >
> > > >
> > > > So we felt Apache Sql Flight Server will be the best suited for our
> use
> > > > case and the client should send a FlightDescriptor object wrapped
> with
> > > the
> > > > sql query.
> > > >
> > > >
> > > > We parsed the query and query s3 using the aws s3 sdks, and return
> the
> > > > data, but the issue is we will end up building our own query parser,
> > > which
> > > > is a bigger task.
> > > >
> > > > Is there any other approach we can try out ?
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Susmit
> > >
> >
>

Re: Query of Arrow Flight SQL with S3 as a storage for parquet files

Reply via email to