gianm commented on issue #9780: URL: https://github.com/apache/druid/issues/9780#issuecomment-4159614606
Additional thoughts: with Hadoop ingestion two things caused problems, 1) Druid loading Hadoop libraries, which happened in `index_hadoop` tasks. 2) Hadoop loading Druid libraries, which happened on the YARN cluster, since we had implemented Mapper and Reducer using Druid libraries. The second one was generally the cause of more problems, because it was really hard to control the classloading situation on the remote side (the MR job on the YARN cluster). Also, the remote side was generally heavier-weight than the client-side libraries that needed to run in the `index_hadoop` tasks. If the Spark connector _does_ need to do either (1) or (2) then I think we could make it more workable by making the integration more arms length. For example, with the Hadoop integration, (2) could have been structured such that the mapper work (mostly parsing and identifying the shuffle key) and the segment generation work are done by shell-outs to a Druid CLI tool, rather than actually loading the Druid libraries. With that structure, integrating as a Unix tool rather than a Java library, we'd run in a different JVM, presumably a newer version. I'm not familiar with how the Spark connection works today, so I'm not sure if these thoughts are relevant. They are just things that came to mind from the experience with Hadoop. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
