Hi,
 not sure if it is your case, but if the source data is heavy and deeply
nested I'd recommend explicitly providing the schema when reading the json.

df = spark.read.schema(schema).json(updated_dataset)


On Thu, 21 Jan 2021 at 04:15, srinivasarao daruna <sree.srin...@gmail.com>
wrote:

> Hi,
> I am running a spark job on a huge dataset. I have allocated 10
> R5.16xlarge machines. (each consists 64cores, 512G).
>
> The source data is json and i need to do some json transformations. So, i
> read them as text and then convert to a dataframe.
>
> ds = spark.read.textFile()
> updated_dataset = ds.withColumn(applying my transformations).as[String]
> df = spark.read.json(updated_dataset)
>
> df.write.save()
>
> Some notes:
> The source data is heavy and deeply nested. The printSchema contains a lot
> of nested structs.
>
> in the spark ui, json stage is first and after that is completed, it is
> not showing any jobs in the UI and it's just hanging there.
>
> All executors were dead and only the driver was active.
>
> Thank You,
> Regards,
> Srini
>

Reply via email to