Hi -- Notice the additional "y" in red (as Mich mentioned)

pyspark --conf queue=default --conf executory-memory=24G

On Thu, May 20, 2021 at 12:02 PM Clay McDonald <
stuart.mcdon...@bateswhite.com> wrote:

> How so?
>
>
>
> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
> *Sent:* Wednesday, May 19, 2021 5:45 PM
> *To:* Clay McDonald <stuart.mcdon...@bateswhite.com>
> *Cc:* user@spark.apache.org
> *Subject:* Re: PySpark Write File Container exited with a non-zero exit
> code 143
>
>
>
> *  *** EXTERNAL EMAIL ***   *
>
>
>
>
>
> Hi Clay,
>
>
>
> Those parameters you are passing are not valid
>
>
>
> pyspark --conf queue=default --conf executory-memory=24G
>
>
>
> Python 3.7.3 (default, Apr  3 2021, 20:42:31)
>
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
>
> Type "help", "copyright", "credits" or "license" for more information.
>
> Warning: Ignoring non-Spark config property: executory-memory
>
> Warning: Ignoring non-Spark config property: queue
>
> 2021-05-19 22:28:20,521 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>
> Setting default log level to "WARN".
>
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
>
> Welcome to
>
>       ____              __
>
>      / __/__  ___ _____/ /__
>
>     _\ \/ _ \/ _ `/ __/  '_/
>
>    /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
>
>       /_/
>
>
>
> Using Python version 3.7.3 (default, Apr  3 2021 20:42:31)
>
> Spark context Web UI available at http://rhes75:4040
>
> Spark context available as 'sc' (master = local[*], app id =
> local-1621459701490).
>
> SparkSession available as 'spark'.
>
>
>
> Also
>
>
>
> pyspark dynamic_ARRAY_generator_parquet.py
>
>
>
> Running python applications through 'pyspark' is not supported as of Spark
> 2.0.
>
> Use ./bin/spark-submit <python file>
>
>
>
>
>
> This works
>
>
>
> $SPARK_HOME/bin/spark-submit --master local[4]
> dynamic_ARRAY_generator_parquet.py
>
>
>
>
>
> See
>
>
>
>  https://spark.apache.org/docs/latest/submitting-applications.html
>
>
>
> HTH
>
>
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Wed, 19 May 2021 at 20:10, Clay McDonald <
> stuart.mcdon...@bateswhite.com> wrote:
>
> Hello all,
>
>
>
> I’m hoping someone can give me some direction for troubleshooting this
> issue, I’m trying to write from Spark on an HortonWorks(Cloudera) HDP
> cluster. I ssh directly to the first datanode and run PySpark with the
> following command; however, it is always failing no matter what size I set
> memory in Yarn Containers and Yarn Queues. Any suggestions?
>
>
>
>
>
>
>
> pyspark --conf queue=default --conf executory-memory=24G
>
>
>
> --
>
>
>
> HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/"
>
> #HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output"
>
> HDFS_OUT="/tmp"
>
> ENCODING="utf-16"
>
>
>
> fileList1=[
>
> 'Test _2003.txt'
>
> ]
>
> from  pyspark.sql.functions import regexp_replace,col
>
> for f in fileList1:
>
>                 fname=f
>
>                 fname_noext=fname.split('.')[0]
>
>                 df =
> spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname),
> header=True)
>
>                 lastcol=df.columns[-1]
>
>                 print('showing {}'.format(fname))
>
>                 if ('\r' in lastcol):
>
>                                 lastcol=lastcol.replace('\r','')
>
>                                 df=df.withColumn(lastcol,
> regexp_replace(col("{}\r".format(lastcol)), "[\r]",
> "")).drop('{}\r'.format(lastcol))
>
>
> df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext))
>
>
>
>
>
>
>
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task
> 0.3 in stage 1.0 (TID 4, DataNode01.mydomain.com, executor 5):
> ExecutorLostFailure (executor 5 exited caused by one of the running tasks)
> Reason: Container marked as failed:
> container_e331_1621375512548_0021_01_000006 on host:
> DataNode01.mydomain.com. Exit status: 143. Diagnostics: [2021-05-19
> 18:09:06.392]Container killed on request. Exit code is 143
> [2021-05-19 18:09:06.413]Container exited with a non-zero exit code 143.
> [2021-05-19 18:09:06.414]Killed by external signal
>
>
>
>
>
> THANKS! CLAY
>
>
>
>

-- 
Best Regards,
Ayan Guha

Reply via email to