1) David, thanks for mentioning that. tbh, this is the first time I'm
reading about it. If we are talking only about SparkFlightSql, It does seem
similar in the sense that both use Arrow for streaming data back to the
client. The difference is the Spark Connect client still has a dependency
on Spark (albeit probably a much more lightweight Spark Connect variant),
rather than on more universal Arrow Flight SQL. Another major difference
that follows from this is that Spark Connect will probably force you to
have a single-process client (i.e. data will be streamed back to a single
client if necessary), while in case of SparkFlightSql a client might be
another distributed application like Spark itself or something similar.

As for a more general use case, generic Spark-backed FlightServers, I don't
think they are comparable. In the case of Spark Connect, despite being a
more lightweight Spark client, fundamentally it still seems to function
similar to a spark application in client deploy mode. Business logic still
has to be written on the client side and there's no easy way to expose that
business logic for users unless you're fine with clients having control
over it. In the case of Spark-backed FlightServers, the use cases I had in
mind are probably best described as distributed data microservices. For
example, you might want to expose a machine learning model inference and
hide it behind application-specific Flight actions and custom authorization
layer or any other scenario when the client has to be fully unaware of the
server code and there's a large amount of data that needs to go on the wire.

(Thinking about it now, It might make more sense to rewrite
SparkFlightManager to use Spark Connect internally in the future rather
than what I have now)

2) I will have to take a closer look at ADBC, but that sounds like a better
idea if most of the Flight SQL functionality will also be exposed by ADBC.
I couldn't find the Java driver for FlightSql in the repo, is it yet to be
written?

3) Ideally, yes. It can move to upstream Spark if Spark community wishes so.
--
Tornike

On Mon, Jul 25, 2022 at 4:36 PM David Li <lidav...@apache.org> wrote:

> So this is now both a Flight SQL producer and consumer for Spark? That is
> very cool.
>
> A couple things I was wondering about:
>
> - How do you think this compares to the Spark Connect proposal? [1]
> - Have you considered ADBC [2] instead of Flight SQL for the DataSourceV2
> implementation? While still under development, the hope is to unify things
> like Flight SQL, arrow-jdbc, etc. under a single umbrella.
> - Lastly, where do you see this progressing from here on out? Do you hope
> to upstream into Spark?
>
> [1]:
> https://databricks.com/blog/2022/07/07/introducing-spark-connect-the-power-of-apache-spark-everywhere.html
> [2]: https://github.com/apache/arrow-adbc
>
> -David
>
> On Sat, Jul 23, 2022, at 21:44, Gavin Ray wrote:
> > This sounds pretty darn nifty!
> > I don't have much of value to offer, but the idea sounds like a great one
> > to me =)
> >
> > On Sat, Jul 23, 2022 at 5:18 PM Tornike Gurgenidze <
> togur...@freeuni.edu.ge>
> > wrote:
> >
> >> David, thank you for the reply.
> >>
> >> I recently managed to find the time to get back to the repo. I thought I
> >> would post the status update for anyone interested.
> >>
> >> The project started out as just FlightSql implementation, but I ended up
> >> splitting it into smaller components:
> >>
> >> 1. SparkFlightManager - a lower-level, more of a utility class, that
> >> enables easier development of Spark-backed FlightServers. It is
> supposed to
> >> take care of FlightServer cluster management, distribution of Spark
> query
> >> results to the FlightServer nodes, service discovery and so on,
> permitting
> >> a developer to focus on just expressing the intended business logic in
> >> Spark. There's a reference FlightServer implementation (
> >>
> >>
> https://github.com/tokoko/SparkFlightSql/blob/main/src/main/scala/com/tokoko/spark/flight/example/SparkParquetFlightProducer.scala
> >> )
> >> that illustrates how a simple parquet reader server can be implemented
> >> using SparkFlightManager.
> >>
> >> 2. SparkFlightSql - SparkFlightSqlProducer class that relies on
> >> SparkFlightManager for most of the technical stuff and focuses on simply
> >> mapping Spark Catalog API metadata to the FlightSql specification.
> >>
> >> 3. FlightSql DataSourceV2 - pretty self-explanatory, there's now also
> the
> >> beginnings of a DataSourceV2 implementation supporting BATCH_READ.
> >>
> >> Once again, if anyone's interested enough to contribute or maybe has a
> use
> >> case for SparkFlightManager, please feel free to reach out.
> >> --
> >> Tornike
> >>
> >> On Sun, May 29, 2022 at 5:26 AM David Li <lidav...@apache.org> wrote:
> >>
> >> > Hi Tornike,
> >> >
> >> > I'll have to take a closer look later when I can get back in front of
> a
> >> > real computer but I just want to say that this is super awesome, and
> >> thank
> >> > you for sharing!
> >> >
> >> > I think we've kicked around the idea of "contrib" projects in the
> past.
> >> > Maybe this can be the impetus to take up that idea? Regardless I want
> to
> >> > say that if you have any questions or feedback about Arrow and Flight
> SQL
> >> > please feel free to post it here.
> >> >
> >> > -David
> >> >
> >> > On Sat, May 28, 2022, at 18:48, Tornike Gurgenidze wrote:
> >> > > Hi,
> >> > >
> >> > > I'm not sure this is the right place to be posting this, so I
> apologize
> >> > in
> >> > > advance.
> >> > >
> >> > > Recently I started a PoC for Arrow Flight SQL Server with Spark
> >> backend (
> >> > > https://github.com/tokoko/SparkFlightSql). The main goal is to
> create
> >> a
> >> > > SparkThriftServer alternative that will benefit from FlightSql
> protocol
> >> > and
> >> > > will also be distributed in nature, i.e. query results won't have to
> >> pass
> >> > > through a single server.
> >> > >
> >> > > I thought it might be interesting for those of you who are also
> >> familiar
> >> > > with Spark. I don't have much of an experience with Arrow, so I
> would
> >> > > appreciate any sort of involvement from Arrow community.
> >> > >
> >> > > Regards,
> >> > > Tornike
> >>

Reply via email to