Hi,

I am using Spark 3.4.1, running on YARN. Hadoop runs on a single-node in a pseudo-distributed mode.

spark-submit --master yarn --deploy-mode cluster --py-files /tmp/app-submodules.zip app.py

The YARN application ran successfully, but have a warning log message:

/opt/hadoop-tmp-dir/nm-local-dir/usercache/bigdata/appcache/application_1691548913900_0002/container_1691548913900_0002_01_000001/pyspark.zip/pyspark/context.py:350: RuntimeWarning: Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

If I use HDFS file:

spark-submit --master yarn --deploy-mode cluster --py-files hdfs://hadoop-namenode:9000/tmp/app-submodules.zip app.py

the warning message looks like this:

/opt/hadoop-tmp-dir/nm-local-dir/usercache/bigdata/appcache/application_1691548913900_0002/container_1691548913900_0002_01_000001/pyspark.zip/pyspark/context.py:350: RuntimeWarning: Failed to add file [hdfs://hadoop-namenode:9000/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

The part code of context.py:

filepath = os.path.join(SparkFiles.getRootDirectory(), filename)
if not os.path.exists(filepath):
    shutil.copyfile(path, filepath)

Look like the submitted Python file has 'file:', 'hdfs:' URI schemes, shutil.copyfile treats them as part of the file name.

I searched, but didn't find useful information, didn't know why, this is a bug or I did something wrong?




---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to