Hi Andrew, Any chance you might give Databricks a try in GCP?
The above transformations look complicated to me, why are you adding dataframes to a list? Regards, Gourav Sengupta On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson <aedav...@ucsc.edu.invalid> wrote: > Hi > > > > I am having trouble debugging my driver. It runs correctly on smaller data > set but fails on large ones. It is very hard to figure out what the bug > is. I suspect it may have something do with the way spark is installed and > configured. I am using google cloud platform dataproc pyspark > > > > The log messages are not helpful. The error message will be something like > "User application exited with status 1" > > > > And > > > > jsonPayload: { > > class: "server.TThreadPoolServer" > > filename: "hive-server2.log" > > message: "Error occurred during processing of message." > > thread: "HiveServer2-Handler-Pool: Thread-40" > > } > > > > I am able to access the spark history server however it does not capture > anything if the driver crashes. I am unable to figure out how to access > spark web UI. > > > > My driver program looks something like the pseudo code bellow. A long list > of transforms with a single action, (i.e. write) at the end. Adding log > messages is not helpful because of lazy evaluations. I am tempted to add > something like > > > > Logger.warn( “DEBUG df.count():{}”.format( df.count() )” to try and inline > some sort of diagnostic message. > > > > What do you think? > > > > Is there a better way to debug this? > > > > Kind regards > > > > Andy > > > > def run(): > > listOfDF = [] > > for filePath in listOfFiles: > > df = spark.read.load( filePath, ...) > > listOfDF.append(df) > > > > > > list2OfDF = [] > > for df in listOfDF: > > df2 = df.select( .... ) > > lsit2OfDF.append( df2 ) > > > > # will setting list to None free cache? > > # or just driver memory > > listOfDF = None > > > > > > df3 = list2OfDF[0] > > > > for i in range( 1, len(list2OfDF) ): > > df = list2OfDF[i] > > df3 = df3.join(df ...) > > > > # will setting to list to None free cache? > > # or just driver memory > > List2OfDF = None > > > > > > lots of narrow transformations on d3 > > > > return df3 > > > > def main() : > > df = run() > > df.write() > > > > > > >