Re: Spark-based ingestion into Druid

itai yaffe Sun, 22 Mar 2020 03:20:08 -0700

Hey everyone,
I created the initial design doc: 
https://docs.google.com/document/d/112VsrCKhtqtUTph5yXMzsaoxtz9wX1U2poi1vxuDswY/edit?usp=sharing
It lays out the motivation and a few more details (as discussed on the 
different channels).
Let’s start working on it together, and then we can get Gian’s review.


BTW - the doc is currently open for everyone to edit, let me know if you think 
I should change that.

On 2020/03/11 22:33:19, itai yaffe <itai.ya...@gmail.com> wrote: 
> Hey Rajiv,
> Can you please provide some details on the use-case of querying Druid from
> Spark (e.g what type of queries, how big is the result set, and any other
> information you think is relevant)?
> 
> Thanks!
> 
> On Tue, Mar 10, 2020 at 6:08 PM Rajiv Mordani <rmord...@vmware.com.invalid>
> wrote:
> 
> > As part of the requirements please include querying / reading from Spark
> > as well. This is a high priority for us.
> >
> > - Rajiv
> >
> > On 3/10/20, 1:26 AM, "Oguzhan Mangir" <sosyalmedya.oguz...@gmail.com>
> > wrote:
> >
> >     What we will do for that? I think, we can start to write requirements
> > and flows.
> >
> >     On 2020/03/05 20:19:38, Julian Jaffe <jja...@pinterest.com.INVALID>
> > wrote:
> >     > Yeah, I think the primary objective here is a standalone writer from
> > Spark
> >     > to Druid.
> >     >
> >     > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe <itai.ya...@gmail.com>
> > wrote:
> >     >
> >     > > Thanks Julian!
> >     > > I'm actually targeting for this connector to allow write
> > capabilities (at
> >     > > least as a first phase), rather than focusing on read capabilities.
> >     > > Having said that, I definitely see the value (even for the
> > use-cases in my
> >     > > company) of having a reader that queries S3 segments directly!
> > Funny, we
> >     > > too have implemented a mechanism (although a very simple one) to
> > get the
> >     > > locations of the segments through SegmentMetadataQueries, to allow
> >     > > batch-oriented queries to work with against the deep storage :)
> >     > >
> >     > > Anyway, as I said, I think we can focus on write capabilities for
> > now, and
> >     > > worry about read capabilities later (if that's OK).
> >     > >
> >     > > On 2020/03/05 18:29:09, Julian Jaffe <jja...@pinterest.com.INVALID
> > >
> >     > > wrote:
> >     > > > The spark-druid-connector you shared brings up another design
> > decision we
> >     > > > should probably talk through. That connector effectively wraps
> > an HTTP
> >     > > > query client with Spark plumbing. An alternative approach (and
> > the one I
> >     > > > ended up building due to our business requirements) is to build
> > a reader
> >     > > > that operates directly over the S3 segments, shifting load for
> > what are
> >     > > > likely very large and non-interactive queries off Druid-specific
> > hardware
> >     > > > (with the exception of a few SegmentMetadataQueries to get
> > location
> >     > > info).
> >     > > >
> >     > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe <itai.ya...@gmail.com>
> > wrote:
> >     > > >
> >     > > > > I'll let Julian answer, but in the meantime, I just wanted to
> > point
> >     > > out we
> >     > > > > might be able to draw some inspiration from this Spark-Redshift
> >     > > connector (
> >     > > > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0
> > ).
> >     > > > > Though it's somewhat outdated, it probably can be used as a
> > reference
> >     > > for
> >     > > > > this new Spark-Druid connector we're planning.
> >     > > > > Another project to look at is
> >     > > > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0
> > .
> >     > > > >
> >     > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
> >     > > sosyalmedya.oguz...@gmail.com>
> >     > > > > wrote:
> >     > > > > > I think second option would be better. Many people use spark
> > for
> >     > > batch
> >     > > > > operations with isolated clusters. Me and my friends will
> > taking time
> >     > > for
> >     > > > > that. Julian, can you share your experiences for that? After
> > that, we
> >     > > can
> >     > > > > write our aims, requirements and flows easily.
> >     > > > > >
> >     > > > > > On 2020/02/26 13:26:13, itai yaffe <itai.ya...@gmail.com>
> > wrote:
> >     > > > > > > Hey,
> >     > > > > > > Per Gian's proposal, and following this thread in Druid
> > user group
> >     > > (
> >     > > > > > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0
> > )
> >     > > and
> >     > > > > this
> >     > > > > > > thread in Druid Slack channel (
> >     > > > > > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=YrWbTt7GCZ6OJPKTKLemTqN7eMXEGtfqFxkGiT4MC6g%3D&reserved=0
> > ),
> >     > > I'd
> >     > > > > like
> >     > > > > > > to start discussing the options of having Spark-based
> > ingestion
> >     > > into
> >     > > > > Druid.
> >     > > > > > >
> >     > > > > > > There's already an old project (
> >     > > > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=HplKPAboYAvUnJ%2BGJUF%2FRbmrGrCI5guUEA%2FdJ64O0b8%3D&reserved=0
> > )
> >     > > > > > > for that, so perhaps we can use that as a starting point.
> >     > > > > > >
> >     > > > > > > The thread on Slack suggested 2 approaches:
> >     > > > > > >
> >     > > > > > >    1. *Simply replacing the Hadoop MapReduce ingestion
> > task* -
> >     > > having a
> >     > > > > > >    Spark batch job that ingests data into Druid, as a
> > simple
> >     > > > > replacement of
> >     > > > > > >    the Hadoop MapReduce ingestion task.
> >     > > > > > >    Meaning - your data pipeline will have a Spark job to
> >     > > pre-process
> >     > > > > the
> >     > > > > > >    data (similar to what some of us have today), and
> > another Spark
> >     > > job
> >     > > > > to read
> >     > > > > > >    the output of the previous job, and create Druid
> > segments
> >     > > (again -
> >     > > > > > >    following the same pattern as the Hadoop MapReduce
> > ingestion
> >     > > task).
> >     > > > > > >    2. *Druid output sink for Spark* - rather than having 2
> > separate
> >     > > > > Spark
> >     > > > > > >    jobs, 1 for pre-processing the data and 1 for ingesting
> > the data
> >     > > > > into
> >     > > > > > >    Druid, you'll have a single Spark job that
> > pre-processes the
> >     > > data
> >     > > > > and
> >     > > > > > >    creates Druid segments directly, e.g
> >     > > > > sparkDataFrame.write.format("druid")
> >     > > > > > >    (as suggested by omngr on Slack).
> >     > > > > > >
> >     > > > > > >
> >     > > > > > > I personally prefer the 2nd approach - while it might be
> > harder to
> >     > > > > > > implement, it seems the benefits are greater in this
> > approach.
> >     > > > > > >
> >     > > > > > > I'd like to hear your thoughts and to start getting this
> > ball
> >     > > rolling.
> >     > > > > > >
> >     > > > > > > Thanks,
> >     > > > > > >            Itai
> >     > > > > > >
> >     > > > > >
> >     > > > > >
> > ---------------------------------------------------------------------
> >     > > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> >     > > > > > For additional commands, e-mail: dev-h...@druid.apache.org
> >     > > > > >
> >     > > > > >
> >     > > > >
> >     > > > >
> > ---------------------------------------------------------------------
> >     > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> >     > > > > For additional commands, e-mail: dev-h...@druid.apache.org
> >     > > > >
> >     > > > >
> >     > > >
> >     > >
> >     > >
> > ---------------------------------------------------------------------
> >     > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> >     > > For additional commands, e-mail: dev-h...@druid.apache.org
> >     > >
> >     > >
> >     >
> >
> >     ---------------------------------------------------------------------
> >     To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> >     For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Re: Spark-based ingestion into Druid

Reply via email to