Re: Spark Druid connectors, take 2

Will Xu Wed, 09 Aug 2023 13:56:34 -0700

Yes, it does make sense.
For #2 (Spark reads Druid), I think Spark also needs to be able to get
schema from Druid. This is probably a query to the broker.
I wonder what's the UX look like for Spark SQL users on how they specify
the schema. Would they create an EXTERNAL TABLE in Spark that maps to a
Druid datasource? Or would that be something users specify as part of table
property? (I think those are good things to cover in the design proposal.)


Regards,
Will


On Wed, Aug 9, 2023 at 2:42 AM Itai Yaffe <itai.ya...@gmail.com> wrote:

> For proper disclosure, it's been a while since I used Druid, but here's my
> 2 cents w.r.t Will's question (based on what I originally wrote in this
> design doc
> <
> https://docs.google.com/document/d/112VsrCKhtqtUTph5yXMzsaoxtz9wX1U2poi1vxuDswY/edit#
> >
> ):
>
>    1. *Spark writes to Druid*:
>       1. Based on what I've seen, the latter would be the more common
>       choice, i.e. *I would assume most users would execute an external
>       Spark job* (external to Druid, that is), e.g. from Databricks/EMR/...
>       That job would process data, and write the output into Druid (in the
>       form of Druid segments directly into Druid's deep storage, plus the
>       required updates to Druid's metadata).
>       2. If the community chooses to go down that route, I think it's also
>       possible to execute other operations (e.g compaction) from external
> Spark
>       jobs, since they are sort of ingestion jobs (if you think about it
> at a
>       high level), as they read Druid segments and write new Druid
> segments.
>    2. *Spark reads from Druid*:
>       1. You can already issue queries to Druid from Spark using JDBC, so
>       in this case, the more appealing option, I think, is *to be able to
>       read segment files directly* (especially for extremely heavy
> queries).
>       2. In addition, you'd need to implement the ability for Spark to read
>       segment files directly, in order to support Druid->Druid ingestion
> (i.e
>       where your input is another Druid datasource), as well as to support
>       compaction tasks (IIRC).
>    3. Generally speaking, I agree with your observation w.r.t the bigger
>    interest in the writer.
>    The only comment here is that some additional benefits of the writer
>    (e.g Druid->Druid ingestion, support for compaction tasks) depend on
>    implementing the reader.
>
> Hope that helps 🙂
>
> Thanks!
>
> On Tue, 8 Aug 2023 at 19:27, Will Xu <will...@imply.io.invalid> wrote:
>
> > For which version to target, I think we should survey the Druid community
> > and get input. In your case, which version are you currently deploying?
> > Historical experience tells me we should target current and current-1
> > (3.4.x and 3.3.x)
> >
> > In terms of the writer (Spark writes to Druid), what's the user workflow
> > you envision? Would you think the user would trigger a spark job from
> > Druid? Or is this user who is submitting a Spark job to target a Druid
> > cluster? The former allows other systems, like compaction, for example,
> to
> > use Spark as a runner.
> >
> > In terms of the reader (Spark reads Druid). I'm most curious to find out
> > what experience you are imagining. Should the reader be reading Druid
> > segment files or would the reader issue queries to Druid (maybe even to
> > historicals?) so that query can be parallelized?
> >
> > Of the two, there is a lot more interest in the writer from the people
> I've
> > been talking to.
> >
> > Regards,
> > Will
> >
> >
> > On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe <julianfja...@gmail.com>
> > wrote:
> >
> > > Hey all,
> > >
> > > There was talk earlier this year about resurrecting the effort to add
> > > direct Spark readers and writers to Druid. Rather than repeat the
> > previous
> > > attempt and parachute in with updated connectors, I’d like to start by
> > > building a little more consensus around what the Druid dev community
> > wants
> > > as potential maintainers.
> > >
> > > To begin with, I want to solicit opinions on two topics:
> > >
> > > Should these connectors be written in Scala or Java? The benefits of
> > Scala
> > > would be that the existing connectors are written in Scala, as are most
> > > open source references for Spark Datasource V2 implementations. The
> > > benefits of Java are that Druid is written in Java, and so engineers
> > > interested in contributing to Druid wouldn’t need to switch between
> > > languages. Additionally, existing tooling, static checkers, etc. could
> be
> > > used with minimal effort, conforming code style and developer
> ergonomics
> > > across Druid instead of needing to keep an alternate Scala tool chain
> in
> > > sync.
> > > Which Spark version should this effort target? The most recently
> released
> > > version of Spark is 3.4.1. Should we aim to integrate with the latest
> > Spark
> > > minor version under the assumption that this will give us the longest
> > > window of support, or should we build against an older minor line (3.3?
> > > 3.2?) since most Spark users tend to lag? For reference, there are
> > > currently 3 stable Spark release versions, 3.2.4, 3.3.2, and 3.4.1.
> From
> > a
> > > user’s point of view, the API is mostly compatible across a major
> version
> > > (i.e. 3.x), while developer APIs such as the ones we would use to build
> > > these connectors can change between minor versions.
> > > There are quite a few nuances and trade offs inherent to the decisions
> > > above, and my hope is that by hashing these choices out before
> presenting
> > > an implementation we can build buy-in from the Druid maintainer
> community
> > > that will result in this effort succeeding where the first effort
> failed.
> > >
> > > Thanks,
> > > Julian
> >
>

Re: Spark Druid connectors, take 2

Reply via email to