Hello all,

I'm hoping someone can give me some direction for troubleshooting this issue, 
I'm trying to write from Spark on an HortonWorks(Cloudera) HDP cluster. I ssh 
directly to the first datanode and run PySpark with the following command; 
however, it is always failing no matter what size I set memory in Yarn 
Containers and Yarn Queues. Any suggestions?



pyspark --conf queue=default --conf executory-memory=24G

--

HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/"
#HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output"
HDFS_OUT="/tmp"
ENCODING="utf-16"

fileList1=[
'Test _2003.txt'
]
from  pyspark.sql.functions import regexp_replace,col
for f in fileList1:
                fname=f
                fname_noext=fname.split('.')[0]
                df = 
spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname),
 header=True)
                lastcol=df.columns[-1]
                print('showing {}'.format(fname))
                if ('\r' in lastcol):
                                lastcol=lastcol.replace('\r','')
                                df=df.withColumn(lastcol, 
regexp_replace(col("{}\r".format(lastcol)), "[\r]", 
"")).drop('{}\r'.format(lastcol))
                
df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext))



Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 4, DataNode01.mydomain.com, executor 5): ExecutorLostFailure (executor 
5 exited caused by one of the running tasks) Reason: Container marked as 
failed: container_e331_1621375512548_0021_01_000006 on host: 
DataNode01.mydomain.com. Exit status: 143. Diagnostics: [2021-05-19 
18:09:06.392]Container killed on request. Exit code is 143
[2021-05-19 18:09:06.413]Container exited with a non-zero exit code 143.
[2021-05-19 18:09:06.414]Killed by external signal


THANKS! CLAY

Reply via email to