Still get the same error with “pyspark --conf queue=default --conf executor-memory=24G”
From: ayan guha <guha.a...@gmail.com> Sent: Thursday, May 20, 2021 12:23 AM To: Clay McDonald <stuart.mcdon...@bateswhite.com> Cc: Mich Talebzadeh <mich.talebza...@gmail.com>; user@spark.apache.org Subject: Re: PySpark Write File Container exited with a non-zero exit code 143 *** EXTERNAL EMAIL *** Hi -- Notice the additional "y" in red (as Mich mentioned) pyspark --conf queue=default --conf executory-memory=24G On Thu, May 20, 2021 at 12:02 PM Clay McDonald <stuart.mcdon...@bateswhite.com<mailto:stuart.mcdon...@bateswhite.com>> wrote: How so? From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Sent: Wednesday, May 19, 2021 5:45 PM To: Clay McDonald <stuart.mcdon...@bateswhite.com<mailto:stuart.mcdon...@bateswhite.com>> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: PySpark Write File Container exited with a non-zero exit code 143 *** EXTERNAL EMAIL *** Hi Clay, Those parameters you are passing are not valid pyspark --conf queue=default --conf executory-memory=24G Python 3.7.3 (default, Apr 3 2021, 20:42:31) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux Type "help", "copyright", "credits" or "license" for more information. Warning: Ignoring non-Spark config property: executory-memory Warning: Ignoring non-Spark config property: queue 2021-05-19 22:28:20,521 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.1.1 /_/ Using Python version 3.7.3 (default, Apr 3 2021 20:42:31) Spark context Web UI available at http://rhes75:4040 Spark context available as 'sc' (master = local[*], app id = local-1621459701490). SparkSession available as 'spark'. Also pyspark dynamic_ARRAY_generator_parquet.py Running python applications through 'pyspark' is not supported as of Spark 2.0. Use ./bin/spark-submit <python file> This works $SPARK_HOME/bin/spark-submit --master local[4] dynamic_ARRAY_generator_parquet.py See https://spark.apache.org/docs/latest/submitting-applications.html HTH [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ] view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 19 May 2021 at 20:10, Clay McDonald <stuart.mcdon...@bateswhite.com<mailto:stuart.mcdon...@bateswhite.com>> wrote: Hello all, I’m hoping someone can give me some direction for troubleshooting this issue, I’m trying to write from Spark on an HortonWorks(Cloudera) HDP cluster. I ssh directly to the first datanode and run PySpark with the following command; however, it is always failing no matter what size I set memory in Yarn Containers and Yarn Queues. Any suggestions? pyspark --conf queue=default --conf executory-memory=24G -- HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/" #HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output" HDFS_OUT="/tmp" ENCODING="utf-16" fileList1=[ 'Test _2003.txt' ] from pyspark.sql.functions import regexp_replace,col for f in fileList1: fname=f fname_noext=fname.split('.')[0] df = spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname), header=True) lastcol=df.columns[-1] print('showing {}'.format(fname)) if ('\r' in lastcol): lastcol=lastcol.replace('\r','') df=df.withColumn(lastcol, regexp_replace(col("{}\r".format(lastcol)), "[\r]", "")).drop('{}\r'.format(lastcol)) df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext)) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, DataNode01.mydomain.com<http://DataNode01.mydomain.com>, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_e331_1621375512548_0021_01_000006 on host: DataNode01.mydomain.com<http://DataNode01.mydomain.com>. Exit status: 143. Diagnostics: [2021-05-19 18:09:06.392]Container killed on request. Exit code is 143 [2021-05-19 18:09:06.413]Container exited with a non-zero exit code 143. [2021-05-19 18:09:06.414]Killed by external signal THANKS! CLAY -- Best Regards, Ayan Guha