Thank you, will keep posted in the same thread Regards, Susmit
On Wed, Oct 16, 2024 at 9:45 PM Weston Pace <weston.p...@gmail.com> wrote: > > Do you folks believe Duckdb and Datafusion (latter being similar to spark > sql) will be an overkill? > > No, I don't believe it would be overkill. > > I also wouldn't compare either one to Spark SQL. Spark SQL is meant to be > a distributed query engine that typically requires a cluster of some sort > to operate at full performance. A distributed query engine would probably > be overkill for your situation. > > Both DuckDb and Datafusion are meant to be lightweight, embeddable, single > node (i.e. not distributed) query engine libraries. These are probably a > good fit for your use case. > > -Weston > > On Wed, Oct 16, 2024 at 8:17 AM Susmit Sarkar <susmitsir...@gmail.com> > wrote: > > > Thanks David and Felipe for your help, I will definitely try out and keep > > you folks updated. > > > > Do you folks believe Duckdb and Datafusion (latter being similar to spark > > sql) will be an overkill? > > > > Thanks, > > Susmit > > > > On Wed, Oct 16, 2024 at 8:25 PM Felipe Oliveira Carvalho < > > felipe...@gmail.com> wrote: > > > > > Hi Susmit, > > > > > > For an example of what David Li is proposing, you can take a look at > this > > > project (https://github.com/voltrondata/sqlflite). It's a Flight SQL > > > server > > > (in C++ though) that can forward queries to either SQLite or DuckDB. > > > > > > -- > > > Felipe > > > > > > On Wed, Oct 16, 2024 at 10:22 AM David Li <lidav...@apache.org> wrote: > > > > > > > If your clients are sending full SQL queries to be executed, and you > > need > > > > to execute them against S3 on the server, why not consider something > > like > > > > Apache DataFusion or DuckDB to implement that part instead of > building > > > the > > > > query parser/engine yourself? (There are probably already examples of > > > > wrapping both these projects in Flight SQL floating around.) > > > > > > > > On Wed, Oct 16, 2024, at 21:38, Susmit Sarkar wrote: > > > > > Hi Community Members > > > > > > > > > > > > > > > We are planning to build an Arrow flight server on top of data > lying > > in > > > > s3. > > > > > > > > > > > > > > > *Detailed Use Case:* > > > > > > > > > > > > > > > The requirement is we need to sync data from HDFS to a short term > > > storage > > > > > S3 is our case. Basically a DataSync Service between cloud storages > > > > > > > > > > > > > > > I have already built the service using Apache Pekko / Akka HDFS & > S3 > > > > > connectors, and data is in sync with HDFS & S3. > > > > > > > > > > > > > > > Now comes the data reading part for end users. The data is stored > in > > > > > Cloudian s3 (Cloudian managed S3 not AWS) short term storage in > > > parquet. > > > > We > > > > > want to build a Data as a Service on top of the data lying in S3 > and > > > > expose > > > > > API endpoints for clients to query. The data lying will be short > > term, > > > > data > > > > > may be of week or months (max 3 months) use-cases varies from teams > > to > > > > > teams. > > > > > > > > > > > > > > > So we felt Apache Sql Flight Server will be the best suited for our > > use > > > > > case and the client should send a FlightDescriptor object wrapped > > with > > > > the > > > > > sql query. > > > > > > > > > > > > > > > We parsed the query and query s3 using the aws s3 sdks, and return > > the > > > > > data, but the issue is we will end up building our own query > parser, > > > > which > > > > > is a bigger task. > > > > > > > > > > Is there any other approach we can try out ? > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Susmit > > > > > > > > > >