We also have a use case of reading from Spark. However we are using HDFS (on prem solution) and not S3. While write would also be needed, our first requirement is really to query the data from Spark. We ingest via Kafka today into Druid.
- Rajiv On 3/5/20, 11:43 AM, "itai yaffe" <itai.ya...@gmail.com> wrote: Thanks Julian! I'm actually targeting for this connector to allow write capabilities (at least as a first phase), rather than focusing on read capabilities. Having said that, I definitely see the value (even for the use-cases in my company) of having a reader that queries S3 segments directly! Funny, we too have implemented a mechanism (although a very simple one) to get the locations of the segments through SegmentMetadataQueries, to allow batch-oriented queries to work with against the deep storage :) Anyway, as I said, I think we can focus on write capabilities for now, and worry about read capabilities later (if that's OK). On 2020/03/05 18:29:09, Julian Jaffe <jja...@pinterest.com.INVALID> wrote: > The spark-druid-connector you shared brings up another design decision we > should probably talk through. That connector effectively wraps an HTTP > query client with Spark plumbing. An alternative approach (and the one I > ended up building due to our business requirements) is to build a reader > that operates directly over the S3 segments, shifting load for what are > likely very large and non-interactive queries off Druid-specific hardware > (with the exception of a few SegmentMetadataQueries to get location info). > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe <itai.ya...@gmail.com> wrote: > > > I'll let Julian answer, but in the meantime, I just wanted to point out we > > might be able to draw some inspiration from this Spark-Redshift connector ( > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=iUCk9MeSzgpxHtnO9VPhAbOMywPE8VHBDKhaMQ6%2Be9Q%3D&reserved=0). > > Though it's somewhat outdated, it probably can be used as a reference for > > this new Spark-Druid connector we're planning. > > Another project to look at is > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=GAlPvt1CxFF2jZpK5vQ31vTY1OEmQOiZ7siJ4IoNuAU%3D&reserved=0. > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <sosyalmedya.oguz...@gmail.com> > > wrote: > > > I think second option would be better. Many people use spark for batch > > operations with isolated clusters. Me and my friends will taking time for > > that. Julian, can you share your experiences for that? After that, we can > > write our aims, requirements and flows easily. > > > > > > On 2020/02/26 13:26:13, itai yaffe <itai.ya...@gmail.com> wrote: > > > > Hey, > > > > Per Gian's proposal, and following this thread in Druid user group ( > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=UP%2FaBDuXJaByUAXQOtFXV2BvA1BV05dF9pOtKguOFNE%3D&reserved=0) and > > this > > > > thread in Druid Slack channel ( > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=cYAEldtu1R8k0BuFoPty4%2BkNtI47gP12W3W4O%2BlRGgc%3D&reserved=0), I'd > > like > > > > to start discussing the options of having Spark-based ingestion into > > Druid. > > > > > > > > There's already an old project ( > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=3YLbT0jKx%2FLYQc2JVYRg3c5zUL5ZP3jUeerW7y%2FatU0%3D&reserved=0) > > > > for that, so perhaps we can use that as a starting point. > > > > > > > > The thread on Slack suggested 2 approaches: > > > > > > > > 1. *Simply replacing the Hadoop MapReduce ingestion task* - having a > > > > Spark batch job that ingests data into Druid, as a simple > > replacement of > > > > the Hadoop MapReduce ingestion task. > > > > Meaning - your data pipeline will have a Spark job to pre-process > > the > > > > data (similar to what some of us have today), and another Spark job > > to read > > > > the output of the previous job, and create Druid segments (again - > > > > following the same pattern as the Hadoop MapReduce ingestion task). > > > > 2. *Druid output sink for Spark* - rather than having 2 separate > > Spark > > > > jobs, 1 for pre-processing the data and 1 for ingesting the data > > into > > > > Druid, you'll have a single Spark job that pre-processes the data > > and > > > > creates Druid segments directly, e.g > > sparkDataFrame.write.format("druid") > > > > (as suggested by omngr on Slack). > > > > > > > > > > > > I personally prefer the 2nd approach - while it might be harder to > > > > implement, it seems the benefits are greater in this approach. > > > > > > > > I'd like to hear your thoughts and to start getting this ball rolling. > > > > > > > > Thanks, > > > > Itai > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org For additional commands, e-mail: dev-h...@druid.apache.org