Re: Query of Arrow Flight SQL with S3 as a storage for parquet files

David Li Mon, 18 Nov 2024 00:18:55 -0800

Hi Susmit,

You can pass headers: see the documentation [1].


[1]: 
https://arrow.apache.org/adbc/current/python/api/adbc_driver_flightsql.html#adbc_driver_flightsql.ConnectionOptions.RPC_CALL_HEADER_PREFIX

-David

On Mon, Nov 18, 2024, at 16:16, Susmit Sarkar wrote:
> Hi Community Members,
>
> We went ahead with duckdb, i have a very basic query, with this setup can
> we use Arrow ADBC, to interact with the flight sql server which internally
> is a wrapper on top of DuckDB to query the data from s3 and stream back to
> client
>
> For every client request the credentials i mean the access and secret keys
> are passed as part of the doget API ticket information, is it possible to
> pass the same with ADBC to flight server?
>
> *data_stream: FlightStreamReader = client.do_get(ticket)*
>
>
> Thanks,
>
> Susmit
>
> On Wed, Oct 16, 2024 at 10:43 PM Susmit Sarkar <susmitsir...@gmail.com>
> wrote:
>
>> Thank you, will keep posted in the same thread
>>
>> Regards,
>> Susmit
>>
>> On Wed, Oct 16, 2024 at 9:45 PM Weston Pace <weston.p...@gmail.com> wrote:
>>
>>> > Do you folks believe Duckdb and Datafusion (latter being similar to
>>> spark
>>> sql) will be an overkill?
>>>
>>> No, I don't believe it would be overkill.
>>>
>>> I also wouldn't compare either one to Spark SQL.  Spark SQL is meant to be
>>> a distributed query engine that typically requires a cluster of some sort
>>> to operate at full performance.  A distributed query engine would probably
>>> be overkill for your situation.
>>>
>>> Both DuckDb and Datafusion are meant to be lightweight, embeddable, single
>>> node (i.e. not distributed) query engine libraries.  These are probably a
>>> good fit for your use case.
>>>
>>> -Weston
>>>
>>> On Wed, Oct 16, 2024 at 8:17 AM Susmit Sarkar <susmitsir...@gmail.com>
>>> wrote:
>>>
>>> > Thanks David and Felipe for your help, I will definitely try out and
>>> keep
>>> > you folks updated.
>>> >
>>> > Do you folks believe Duckdb and Datafusion (latter being similar to
>>> spark
>>> > sql) will be an overkill?
>>> >
>>> > Thanks,
>>> > Susmit
>>> >
>>> > On Wed, Oct 16, 2024 at 8:25 PM Felipe Oliveira Carvalho <
>>> > felipe...@gmail.com> wrote:
>>> >
>>> > > Hi Susmit,
>>> > >
>>> > > For an example of what David Li is proposing, you can take a look at
>>> this
>>> > > project (https://github.com/voltrondata/sqlflite). It's a Flight SQL
>>> > > server
>>> > > (in C++ though) that can forward queries to either SQLite or DuckDB.
>>> > >
>>> > > --
>>> > > Felipe
>>> > >
>>> > > On Wed, Oct 16, 2024 at 10:22 AM David Li <lidav...@apache.org>
>>> wrote:
>>> > >
>>> > > > If your clients are sending full SQL queries to be executed, and you
>>> > need
>>> > > > to execute them against S3 on the server, why not consider something
>>> > like
>>> > > > Apache DataFusion or DuckDB to implement that part instead of
>>> building
>>> > > the
>>> > > > query parser/engine yourself? (There are probably already examples
>>> of
>>> > > > wrapping both these projects in Flight SQL floating around.)
>>> > > >
>>> > > > On Wed, Oct 16, 2024, at 21:38, Susmit Sarkar wrote:
>>> > > > > Hi Community Members
>>> > > > >
>>> > > > >
>>> > > > > We are planning to build an Arrow flight server on top of data
>>> lying
>>> > in
>>> > > > s3.
>>> > > > >
>>> > > > >
>>> > > > > *Detailed Use Case:*
>>> > > > >
>>> > > > >
>>> > > > > The requirement is we need to sync data from HDFS to a short term
>>> > > storage
>>> > > > > S3 is our case. Basically a DataSync Service between cloud
>>> storages
>>> > > > >
>>> > > > >
>>> > > > > I have already built the service using Apache Pekko / Akka HDFS &
>>> S3
>>> > > > > connectors, and data is in sync with HDFS & S3.
>>> > > > >
>>> > > > >
>>> > > > > Now comes the data reading part for end users. The data is stored
>>> in
>>> > > > > Cloudian s3 (Cloudian managed S3 not AWS) short term storage in
>>> > > parquet.
>>> > > > We
>>> > > > > want to build a Data as a Service on top of the data lying in S3
>>> and
>>> > > > expose
>>> > > > > API endpoints for clients to query. The data lying will be short
>>> > term,
>>> > > > data
>>> > > > > may be of week or months (max 3 months) use-cases varies from
>>> teams
>>> > to
>>> > > > > teams.
>>> > > > >
>>> > > > >
>>> > > > > So we felt Apache Sql Flight Server will be the best suited for
>>> our
>>> > use
>>> > > > > case and the client should send a FlightDescriptor object wrapped
>>> > with
>>> > > > the
>>> > > > > sql query.
>>> > > > >
>>> > > > >
>>> > > > > We parsed the query and query s3 using the aws s3 sdks, and return
>>> > the
>>> > > > > data, but the issue is we will end up building our own query
>>> parser,
>>> > > > which
>>> > > > > is a bigger task.
>>> > > > >
>>> > > > > Is there any other approach we can try out ?
>>> > > > >
>>> > > > >
>>> > > > > Thanks,
>>> > > > >
>>> > > > > Susmit
>>> > > >
>>> > >
>>> >
>>>
>>

Re: Query of Arrow Flight SQL with S3 as a storage for parquet files

Reply via email to