Re: Spark-based ingestion into Druid

Rajiv Mordani Tue, 10 Mar 2020 09:08:26 -0700

We also have a use case of reading from Spark. However we are using HDFS (on 
prem solution) and not S3. While write would also be needed, our first 
requirement is really to query the data from Spark. We ingest via Kafka today 
into Druid.


- Rajiv

On 3/5/20, 11:43 AM, "itai yaffe" <itai.ya...@gmail.com> wrote:

    Thanks Julian!
    I'm actually targeting for this connector to allow write capabilities (at 
least as a first phase), rather than focusing on read capabilities.
    Having said that, I definitely see the value (even for the use-cases in my 
company) of having a reader that queries S3 segments directly! Funny, we too 
have implemented a mechanism (although a very simple one) to get the locations 
of the segments through SegmentMetadataQueries, to allow batch-oriented queries 
to work with against the deep storage :)

    Anyway, as I said, I think we can focus on write capabilities for now, and 
worry about read capabilities later (if that's OK).

    On 2020/03/05 18:29:09, Julian Jaffe <jja...@pinterest.com.INVALID> wrote: 
    > The spark-druid-connector you shared brings up another design decision we
    > should probably talk through. That connector effectively wraps an HTTP
    > query client with Spark plumbing. An alternative approach (and the one I
    > ended up building due to our business requirements) is to build a reader
    > that operates directly over the S3 segments, shifting load for what are
    > likely very large and non-interactive queries off Druid-specific hardware
    > (with the exception of a few SegmentMetadataQueries to get location info).
    > 
    > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe <itai.ya...@gmail.com> wrote:
    > 
    > > I'll let Julian answer, but in the meantime, I just wanted to point out 
we
    > > might be able to draw some inspiration from this Spark-Redshift 
connector (
    > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&amp;data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&amp;sdata=iUCk9MeSzgpxHtnO9VPhAbOMywPE8VHBDKhaMQ6%2Be9Q%3D&amp;reserved=0).
    > > Though it's somewhat outdated, it probably can be used as a reference 
for
    > > this new Spark-Druid connector we're planning.
    > > Another project to look at is
    > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&amp;data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&amp;sdata=GAlPvt1CxFF2jZpK5vQ31vTY1OEmQOiZ7siJ4IoNuAU%3D&amp;reserved=0.
    > >
    > > On 2020/03/02 14:31:27, O��uzhan Mang��r <sosyalmedya.oguz...@gmail.com>
    > > wrote:
    > > > I think second option would be better. Many people use spark for batch
    > > operations with isolated clusters. Me and my friends will taking time 
for
    > > that. Julian, can you share your experiences for that? After that, we 
can
    > > write our aims, requirements and flows easily.
    > > >
    > > > On 2020/02/26 13:26:13, itai yaffe <itai.ya...@gmail.com> wrote:
    > > > > Hey,
    > > > > Per Gian's proposal, and following this thread in Druid user group (
    > > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&amp;data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&amp;sdata=UP%2FaBDuXJaByUAXQOtFXV2BvA1BV05dF9pOtKguOFNE%3D&amp;reserved=0)
 and
    > > this
    > > > > thread in Druid Slack channel (
    > > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&amp;data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&amp;sdata=cYAEldtu1R8k0BuFoPty4%2BkNtI47gP12W3W4O%2BlRGgc%3D&amp;reserved=0),
 I'd
    > > like
    > > > > to start discussing the options of having Spark-based ingestion into
    > > Druid.
    > > > >
    > > > > There's already an old project (
    > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&amp;data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&amp;sdata=3YLbT0jKx%2FLYQc2JVYRg3c5zUL5ZP3jUeerW7y%2FatU0%3D&amp;reserved=0)
    > > > > for that, so perhaps we can use that as a starting point.
    > > > >
    > > > > The thread on Slack suggested 2 approaches:
    > > > >
    > > > >    1. *Simply replacing the Hadoop MapReduce ingestion task* - 
having a
    > > > >    Spark batch job that ingests data into Druid, as a simple
    > > replacement of
    > > > >    the Hadoop MapReduce ingestion task.
    > > > >    Meaning - your data pipeline will have a Spark job to pre-process
    > > the
    > > > >    data (similar to what some of us have today), and another Spark 
job
    > > to read
    > > > >    the output of the previous job, and create Druid segments (again 
-
    > > > >    following the same pattern as the Hadoop MapReduce ingestion 
task).
    > > > >    2. *Druid output sink for Spark* - rather than having 2 separate
    > > Spark
    > > > >    jobs, 1 for pre-processing the data and 1 for ingesting the data
    > > into
    > > > >    Druid, you'll have a single Spark job that pre-processes the data
    > > and
    > > > >    creates Druid segments directly, e.g
    > > sparkDataFrame.write.format("druid")
    > > > >    (as suggested by omngr on Slack).
    > > > >
    > > > >
    > > > > I personally prefer the 2nd approach - while it might be harder to
    > > > > implement, it seems the benefits are greater in this approach.
    > > > >
    > > > > I'd like to hear your thoughts and to start getting this ball 
rolling.
    > > > >
    > > > > Thanks,
    > > > >            Itai
    > > > >
    > > >
    > > > ---------------------------------------------------------------------
    > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
    > > > For additional commands, e-mail: dev-h...@druid.apache.org
    > > >
    > > >
    > >
    > > ---------------------------------------------------------------------
    > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
    > > For additional commands, e-mail: dev-h...@druid.apache.org
    > >
    > >
    > 

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
    For additional commands, e-mail: dev-h...@druid.apache.org

Re: Spark-based ingestion into Druid

Reply via email to