Hi, not sure if it is your case, but if the source data is heavy and deeply nested I'd recommend explicitly providing the schema when reading the json.
df = spark.read.schema(schema).json(updated_dataset) On Thu, 21 Jan 2021 at 04:15, srinivasarao daruna <sree.srin...@gmail.com> wrote: > Hi, > I am running a spark job on a huge dataset. I have allocated 10 > R5.16xlarge machines. (each consists 64cores, 512G). > > The source data is json and i need to do some json transformations. So, i > read them as text and then convert to a dataframe. > > ds = spark.read.textFile() > updated_dataset = ds.withColumn(applying my transformations).as[String] > df = spark.read.json(updated_dataset) > > df.write.save() > > Some notes: > The source data is heavy and deeply nested. The printSchema contains a lot > of nested structs. > > in the spark ui, json stage is first and after that is completed, it is > not showing any jobs in the UI and it's just hanging there. > > All executors were dead and only the driver was active. > > Thank You, > Regards, > Srini >