Hi Community Members, We went ahead with duckdb, i have a very basic query, with this setup can we use Arrow ADBC, to interact with the flight sql server which internally is a wrapper on top of DuckDB to query the data from s3 and stream back to client
For every client request the credentials i mean the access and secret keys are passed as part of the doget API ticket information, is it possible to pass the same with ADBC to flight server? *data_stream: FlightStreamReader = client.do_get(ticket)* Thanks, Susmit On Wed, Oct 16, 2024 at 10:43 PM Susmit Sarkar <susmitsir...@gmail.com> wrote: > Thank you, will keep posted in the same thread > > Regards, > Susmit > > On Wed, Oct 16, 2024 at 9:45 PM Weston Pace <weston.p...@gmail.com> wrote: > >> > Do you folks believe Duckdb and Datafusion (latter being similar to >> spark >> sql) will be an overkill? >> >> No, I don't believe it would be overkill. >> >> I also wouldn't compare either one to Spark SQL. Spark SQL is meant to be >> a distributed query engine that typically requires a cluster of some sort >> to operate at full performance. A distributed query engine would probably >> be overkill for your situation. >> >> Both DuckDb and Datafusion are meant to be lightweight, embeddable, single >> node (i.e. not distributed) query engine libraries. These are probably a >> good fit for your use case. >> >> -Weston >> >> On Wed, Oct 16, 2024 at 8:17 AM Susmit Sarkar <susmitsir...@gmail.com> >> wrote: >> >> > Thanks David and Felipe for your help, I will definitely try out and >> keep >> > you folks updated. >> > >> > Do you folks believe Duckdb and Datafusion (latter being similar to >> spark >> > sql) will be an overkill? >> > >> > Thanks, >> > Susmit >> > >> > On Wed, Oct 16, 2024 at 8:25 PM Felipe Oliveira Carvalho < >> > felipe...@gmail.com> wrote: >> > >> > > Hi Susmit, >> > > >> > > For an example of what David Li is proposing, you can take a look at >> this >> > > project (https://github.com/voltrondata/sqlflite). It's a Flight SQL >> > > server >> > > (in C++ though) that can forward queries to either SQLite or DuckDB. >> > > >> > > -- >> > > Felipe >> > > >> > > On Wed, Oct 16, 2024 at 10:22 AM David Li <lidav...@apache.org> >> wrote: >> > > >> > > > If your clients are sending full SQL queries to be executed, and you >> > need >> > > > to execute them against S3 on the server, why not consider something >> > like >> > > > Apache DataFusion or DuckDB to implement that part instead of >> building >> > > the >> > > > query parser/engine yourself? (There are probably already examples >> of >> > > > wrapping both these projects in Flight SQL floating around.) >> > > > >> > > > On Wed, Oct 16, 2024, at 21:38, Susmit Sarkar wrote: >> > > > > Hi Community Members >> > > > > >> > > > > >> > > > > We are planning to build an Arrow flight server on top of data >> lying >> > in >> > > > s3. >> > > > > >> > > > > >> > > > > *Detailed Use Case:* >> > > > > >> > > > > >> > > > > The requirement is we need to sync data from HDFS to a short term >> > > storage >> > > > > S3 is our case. Basically a DataSync Service between cloud >> storages >> > > > > >> > > > > >> > > > > I have already built the service using Apache Pekko / Akka HDFS & >> S3 >> > > > > connectors, and data is in sync with HDFS & S3. >> > > > > >> > > > > >> > > > > Now comes the data reading part for end users. The data is stored >> in >> > > > > Cloudian s3 (Cloudian managed S3 not AWS) short term storage in >> > > parquet. >> > > > We >> > > > > want to build a Data as a Service on top of the data lying in S3 >> and >> > > > expose >> > > > > API endpoints for clients to query. The data lying will be short >> > term, >> > > > data >> > > > > may be of week or months (max 3 months) use-cases varies from >> teams >> > to >> > > > > teams. >> > > > > >> > > > > >> > > > > So we felt Apache Sql Flight Server will be the best suited for >> our >> > use >> > > > > case and the client should send a FlightDescriptor object wrapped >> > with >> > > > the >> > > > > sql query. >> > > > > >> > > > > >> > > > > We parsed the query and query s3 using the aws s3 sdks, and return >> > the >> > > > > data, but the issue is we will end up building our own query >> parser, >> > > > which >> > > > > is a bigger task. >> > > > > >> > > > > Is there any other approach we can try out ? >> > > > > >> > > > > >> > > > > Thanks, >> > > > > >> > > > > Susmit >> > > > >> > > >> > >> >