gianm commented on issue #9780:
URL: https://github.com/apache/druid/issues/9780#issuecomment-4159614606

   Additional thoughts: with Hadoop ingestion two things caused problems,
   
   1) Druid loading Hadoop libraries, which happened in `index_hadoop` tasks.
   2) Hadoop loading Druid libraries, which happened on the YARN cluster, since 
we had implemented Mapper and Reducer using Druid libraries.
   
   The second one was generally the cause of more problems, because it was 
really hard to control the classloading situation on the remote side (the MR 
job on the YARN cluster). Also, the remote side was generally heavier-weight 
than the client-side libraries that needed to run in the `index_hadoop` tasks.
   
   If the Spark connector _does_ need to do either (1) or (2) then I think we 
could make it more workable by making the integration more arms length. For 
example, with the Hadoop integration, (2) could have been structured such that 
the mapper work (mostly parsing and identifying the shuffle key) and the 
segment generation work are done by shell-outs to a Druid CLI tool, rather than 
actually loading the Druid libraries. With that structure, integrating as a 
Unix tool rather than a Java library, we'd run in a different JVM, presumably a 
newer version.
   
   I'm not familiar with how the Spark connection works today, so I'm not sure 
if these thoughts are relevant. They are just things that came to mind from the 
experience with Hadoop.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to