zjffdu commented on pull request #4097: URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-843283503
> I was aware of this, but it seems that downloading dependencies several times is the way of `spark.archives`. It is clear that this is not optimal. If I read spark code correctly, spark only download `spark.archives` in driver side and distribute to executors via its internal mechanism between driver and executor. > By local, do you mean the local file system of the Zeppelin server? In my environment, the Zeppelin user does not have access to the local file system of the Zeppelin server. Therefore, I would prefer a remote endpoint that is under the control of the Zeppelin user. > I understand your development approach and it sounds great, but I think this is not suitable for a production environment. Let me clarify it, actually it is not only local file system, it could be any hadoop compatible file system, such as hdfs, s3. > Maybe we can support the download in `JupyterKernelInterpreter.java` with an additional property. Then it should not matter whether the files were provided by YARN or the download. Actually for spark yarn mode, it would still use yarn cache mechanism to distribute archives [1]. Of course, for k8s mode, we can use other property to download the archive in `JupyterKernelInterpreter.java` for pure python interpreter, but for pyspark, it is not necessary, because it is already done by SparkSubmit [2] * [1] https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1663 * [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L391 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org