Hello all, I'm hoping someone can give me some direction for troubleshooting this issue, I'm trying to write from Spark on an HortonWorks(Cloudera) HDP cluster. I ssh directly to the first datanode and run PySpark with the following command; however, it is always failing no matter what size I set memory in Yarn Containers and Yarn Queues. Any suggestions?
pyspark --conf queue=default --conf executory-memory=24G -- HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/" #HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output" HDFS_OUT="/tmp" ENCODING="utf-16" fileList1=[ 'Test _2003.txt' ] from pyspark.sql.functions import regexp_replace,col for f in fileList1: fname=f fname_noext=fname.split('.')[0] df = spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname), header=True) lastcol=df.columns[-1] print('showing {}'.format(fname)) if ('\r' in lastcol): lastcol=lastcol.replace('\r','') df=df.withColumn(lastcol, regexp_replace(col("{}\r".format(lastcol)), "[\r]", "")).drop('{}\r'.format(lastcol)) df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext)) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, DataNode01.mydomain.com, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_e331_1621375512548_0021_01_000006 on host: DataNode01.mydomain.com. Exit status: 143. Diagnostics: [2021-05-19 18:09:06.392]Container killed on request. Exit code is 143 [2021-05-19 18:09:06.413]Container exited with a non-zero exit code 143. [2021-05-19 18:09:06.414]Killed by external signal THANKS! CLAY