Re: Query of Arrow Flight SQL with S3 as a storage for parquet files

Susmit Sarkar Sun, 17 Nov 2024 23:16:46 -0800

Hi Community Members,

We went ahead with duckdb, i have a very basic query, with this setup can
we use Arrow ADBC, to interact with the flight sql server which internally
is a wrapper on top of DuckDB to query the data from s3 and stream back to
client


For every client request the credentials i mean the access and secret keys
are passed as part of the doget API ticket information, is it possible to
pass the same with ADBC to flight server?

*data_stream: FlightStreamReader = client.do_get(ticket)*


Thanks,

Susmit

On Wed, Oct 16, 2024 at 10:43 PM Susmit Sarkar <[email protected]>
wrote:

> Thank you, will keep posted in the same thread
>
> Regards,
> Susmit
>
> On Wed, Oct 16, 2024 at 9:45 PM Weston Pace <[email protected]> wrote:
>
>> > Do you folks believe Duckdb and Datafusion (latter being similar to
>> spark
>> sql) will be an overkill?
>>
>> No, I don't believe it would be overkill.
>>
>> I also wouldn't compare either one to Spark SQL.  Spark SQL is meant to be
>> a distributed query engine that typically requires a cluster of some sort
>> to operate at full performance.  A distributed query engine would probably
>> be overkill for your situation.
>>
>> Both DuckDb and Datafusion are meant to be lightweight, embeddable, single
>> node (i.e. not distributed) query engine libraries.  These are probably a
>> good fit for your use case.
>>
>> -Weston
>>
>> On Wed, Oct 16, 2024 at 8:17 AM Susmit Sarkar <[email protected]>
>> wrote:
>>
>> > Thanks David and Felipe for your help, I will definitely try out and
>> keep
>> > you folks updated.
>> >
>> > Do you folks believe Duckdb and Datafusion (latter being similar to
>> spark
>> > sql) will be an overkill?
>> >
>> > Thanks,
>> > Susmit
>> >
>> > On Wed, Oct 16, 2024 at 8:25 PM Felipe Oliveira Carvalho <
>> > [email protected]> wrote:
>> >
>> > > Hi Susmit,
>> > >
>> > > For an example of what David Li is proposing, you can take a look at
>> this
>> > > project (https://github.com/voltrondata/sqlflite). It's a Flight SQL
>> > > server
>> > > (in C++ though) that can forward queries to either SQLite or DuckDB.
>> > >
>> > > --
>> > > Felipe
>> > >
>> > > On Wed, Oct 16, 2024 at 10:22 AM David Li <[email protected]>
>> wrote:
>> > >
>> > > > If your clients are sending full SQL queries to be executed, and you
>> > need
>> > > > to execute them against S3 on the server, why not consider something
>> > like
>> > > > Apache DataFusion or DuckDB to implement that part instead of
>> building
>> > > the
>> > > > query parser/engine yourself? (There are probably already examples
>> of
>> > > > wrapping both these projects in Flight SQL floating around.)
>> > > >
>> > > > On Wed, Oct 16, 2024, at 21:38, Susmit Sarkar wrote:
>> > > > > Hi Community Members
>> > > > >
>> > > > >
>> > > > > We are planning to build an Arrow flight server on top of data
>> lying
>> > in
>> > > > s3.
>> > > > >
>> > > > >
>> > > > > *Detailed Use Case:*
>> > > > >
>> > > > >
>> > > > > The requirement is we need to sync data from HDFS to a short term
>> > > storage
>> > > > > S3 is our case. Basically a DataSync Service between cloud
>> storages
>> > > > >
>> > > > >
>> > > > > I have already built the service using Apache Pekko / Akka HDFS &
>> S3
>> > > > > connectors, and data is in sync with HDFS & S3.
>> > > > >
>> > > > >
>> > > > > Now comes the data reading part for end users. The data is stored
>> in
>> > > > > Cloudian s3 (Cloudian managed S3 not AWS) short term storage in
>> > > parquet.
>> > > > We
>> > > > > want to build a Data as a Service on top of the data lying in S3
>> and
>> > > > expose
>> > > > > API endpoints for clients to query. The data lying will be short
>> > term,
>> > > > data
>> > > > > may be of week or months (max 3 months) use-cases varies from
>> teams
>> > to
>> > > > > teams.
>> > > > >
>> > > > >
>> > > > > So we felt Apache Sql Flight Server will be the best suited for
>> our
>> > use
>> > > > > case and the client should send a FlightDescriptor object wrapped
>> > with
>> > > > the
>> > > > > sql query.
>> > > > >
>> > > > >
>> > > > > We parsed the query and query s3 using the aws s3 sdks, and return
>> > the
>> > > > > data, but the issue is we will end up building our own query
>> parser,
>> > > > which
>> > > > > is a bigger task.
>> > > > >
>> > > > > Is there any other approach we can try out ?
>> > > > >
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Susmit
>> > > >
>> > >
>> >
>>
>

Re: Query of Arrow Flight SQL with S3 as a storage for parquet files

Reply via email to