Re: Pyspark debugging best practices

Gourav Sengupta Tue, 28 Dec 2021 04:19:28 -0800

Hi Andrew,

Any chance you might give Databricks a try in GCP?


The above transformations look complicated to me, why are you adding
dataframes to a list?


Regards,
Gourav Sengupta



On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson <aedav...@ucsc.edu.invalid>
wrote:

> Hi
>
>
>
> I am having trouble debugging my driver. It runs correctly on smaller data
> set but fails on large ones.  It is very hard to figure out what the bug
> is. I suspect it may have something do with the way spark is installed and
> configured. I am using google cloud platform dataproc pyspark
>
>
>
> The log messages are not helpful. The error message will be something like
> "User application exited with status 1"
>
>
>
> And
>
>
>
> jsonPayload: {
>
> class: "server.TThreadPoolServer"
>
> filename: "hive-server2.log"
>
> message: "Error occurred during processing of message."
>
> thread: "HiveServer2-Handler-Pool: Thread-40"
>
> }
>
>
>
> I am able to access the spark history server however it does not capture
> anything if the driver crashes. I am unable to figure out how to access
> spark web UI.
>
>
>
> My driver program looks something like the pseudo code bellow. A long list
> of transforms with a single action, (i.e. write) at the end. Adding log
> messages is not helpful because of lazy evaluations. I am tempted to add
> something like
>
>
>
> Logger.warn( “DEBUG df.count():{}”.format( df.count() )” to try and inline
> some sort of diagnostic message.
>
>
>
> What do you think?
>
>
>
> Is there a better way to debug this?
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> def run():
>
>     listOfDF = []
>
>     for filePath in listOfFiles:
>
>         df = spark.read.load( filePath, ...)
>
>         listOfDF.append(df)
>
>
>
>
>
>     list2OfDF = []
>
>     for df in listOfDF:
>
>         df2 = df.select( .... )
>
>         lsit2OfDF.append( df2 )
>
>
>
>     # will setting list to None free cache?
>
>     # or just driver memory
>
>     listOfDF = None
>
>
>
>
>
>     df3 = list2OfDF[0]
>
>
>
>     for i in range( 1, len(list2OfDF) ):
>
>         df = list2OfDF[i]
>
>         df3 = df3.join(df ...)
>
>
>
>     # will setting to list to None free cache?
>
>     # or just driver memory
>
>     List2OfDF = None
>
>
>
>
>
>     lots of narrow transformations on d3
>
>
>
>     return df3
>
>
>
> def main() :
>
>     df = run()
>
>     df.write()
>
>
>
>
>
>
>

Re: Pyspark debugging best practices

Reply via email to