Hi -- Notice the additional "y" in red (as Mich mentioned) pyspark --conf queue=default --conf executory-memory=24G
On Thu, May 20, 2021 at 12:02 PM Clay McDonald < stuart.mcdon...@bateswhite.com> wrote: > How so? > > > > *From:* Mich Talebzadeh <mich.talebza...@gmail.com> > *Sent:* Wednesday, May 19, 2021 5:45 PM > *To:* Clay McDonald <stuart.mcdon...@bateswhite.com> > *Cc:* user@spark.apache.org > *Subject:* Re: PySpark Write File Container exited with a non-zero exit > code 143 > > > > * *** EXTERNAL EMAIL *** * > > > > > > Hi Clay, > > > > Those parameters you are passing are not valid > > > > pyspark --conf queue=default --conf executory-memory=24G > > > > Python 3.7.3 (default, Apr 3 2021, 20:42:31) > > [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux > > Type "help", "copyright", "credits" or "license" for more information. > > Warning: Ignoring non-Spark config property: executory-memory > > Warning: Ignoring non-Spark config property: queue > > 2021-05-19 22:28:20,521 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > > Setting default log level to "WARN". > > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > > Welcome to > > ____ __ > > / __/__ ___ _____/ /__ > > _\ \/ _ \/ _ `/ __/ '_/ > > /__ / .__/\_,_/_/ /_/\_\ version 3.1.1 > > /_/ > > > > Using Python version 3.7.3 (default, Apr 3 2021 20:42:31) > > Spark context Web UI available at http://rhes75:4040 > > Spark context available as 'sc' (master = local[*], app id = > local-1621459701490). > > SparkSession available as 'spark'. > > > > Also > > > > pyspark dynamic_ARRAY_generator_parquet.py > > > > Running python applications through 'pyspark' is not supported as of Spark > 2.0. > > Use ./bin/spark-submit <python file> > > > > > > This works > > > > $SPARK_HOME/bin/spark-submit --master local[4] > dynamic_ARRAY_generator_parquet.py > > > > > > See > > > > https://spark.apache.org/docs/latest/submitting-applications.html > > > > HTH > > > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > > > > On Wed, 19 May 2021 at 20:10, Clay McDonald < > stuart.mcdon...@bateswhite.com> wrote: > > Hello all, > > > > I’m hoping someone can give me some direction for troubleshooting this > issue, I’m trying to write from Spark on an HortonWorks(Cloudera) HDP > cluster. I ssh directly to the first datanode and run PySpark with the > following command; however, it is always failing no matter what size I set > memory in Yarn Containers and Yarn Queues. Any suggestions? > > > > > > > > pyspark --conf queue=default --conf executory-memory=24G > > > > -- > > > > HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/" > > #HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output" > > HDFS_OUT="/tmp" > > ENCODING="utf-16" > > > > fileList1=[ > > 'Test _2003.txt' > > ] > > from pyspark.sql.functions import regexp_replace,col > > for f in fileList1: > > fname=f > > fname_noext=fname.split('.')[0] > > df = > spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname), > header=True) > > lastcol=df.columns[-1] > > print('showing {}'.format(fname)) > > if ('\r' in lastcol): > > lastcol=lastcol.replace('\r','') > > df=df.withColumn(lastcol, > regexp_replace(col("{}\r".format(lastcol)), "[\r]", > "")).drop('{}\r'.format(lastcol)) > > > df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext)) > > > > > > > > Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task > 0.3 in stage 1.0 (TID 4, DataNode01.mydomain.com, executor 5): > ExecutorLostFailure (executor 5 exited caused by one of the running tasks) > Reason: Container marked as failed: > container_e331_1621375512548_0021_01_000006 on host: > DataNode01.mydomain.com. Exit status: 143. Diagnostics: [2021-05-19 > 18:09:06.392]Container killed on request. Exit code is 143 > [2021-05-19 18:09:06.413]Container exited with a non-zero exit code 143. > [2021-05-19 18:09:06.414]Killed by external signal > > > > > > THANKS! CLAY > > > > -- Best Regards, Ayan Guha