On 24 Mar 2016, at 15:27, Koert Kuipers <ko...@tresata.com<mailto:ko...@tresata.com>> wrote:
i think the arguments are convincing, but it also makes me wonder if i live in some kind of alternate universe... we deploy on customers clusters, where the OS, python version, java version and hadoop distro are not chosen by us. so think centos 6, cdh5 or hdp 2.3, java 7 and python 2.6. we simply have access to a single proxy machine and launch through yarn. asking them to upgrade java is pretty much out of the question or a 6+ month ordeal. of the 10 client clusters i can think of on the top of my head all of them are on java 7, none are on java 8. so by doing this you would make spark 2 basically unusable for us (unless most of them have plans of upgrading in near term to java 8, i will ask around and report back...). It's not actually mandatory for the process executing in the Yarn cluster to run with the same JVM as the rest of the Hadoop stack; all that is needed is for the environment variables to set up the JAVA_HOME and PATH. Switching JVMs not something which YARN makes it easy to do, but it may be possible, especially if Spark itself provides some hooks, so you don't have to manually lay with setting things up. That may be something which could significantly ease adoption of Spark 2 in YARN clusters. Same for Python. This is something I could probably help others to address