Re: reading each JSON file from dataframe...

2022-07-13 Thread Gourav Sengupta
Hi, I think that this is a pure example of over engineering. Ayan's advice is the best. Please use SPARK SQL function called as input_file_name() to join the tables. People do not think in terms of RDD anymore unless absolutely required. Also if you have different JSON schemas, just use the SPAR

Re: reading each JSON file from dataframe...

2022-07-12 Thread Muthu Jayakumar
Hello Ayan, Thank you for the suggestion. But, I would lose correlation of the JSON file with the other identifier fields. Also, if there are too many files, will it be an issue? Plus, I may not have the same schema across all the files. Hello Enrico, >how does RDD's mapPartitions make a differe

Re: reading each JSON file from dataframe...

2022-07-12 Thread ayan guha
Another option is: 1. collect the dataframe with file path 2. create a list of paths 3. create a new dataframe with spark.read.json and pass the list of path This will save you lots of headache Ayan On Wed, Jul 13, 2022 at 7:35 AM Enrico Minack wrote: > Hi, > > how does RDD's mapPartitions m

Re: reading each JSON file from dataframe...

2022-07-12 Thread Enrico Minack
Hi, how does RDD's mapPartitions make a difference regarding 1. and 2. compared to Dataset's mapPartitions / map function? Enrico Am 12.07.22 um 22:13 schrieb Muthu Jayakumar: Hello Enrico, Thanks for the reply. I found that I would have to use `mapPartitions` API of RDD to perform this s

Re: reading each JSON file from dataframe...

2022-07-12 Thread Muthu Jayakumar
Hello Enrico, Thanks for the reply. I found that I would have to use `mapPartitions` API of RDD to perform this safely as I have to 1. Read each file from GCS using HDFS FileSystem API. 2. Parse each JSON record in a safe manner. For (1) to work, I do have to broadcast HadoopConfiguration from sp

Re: reading each JSON file from dataframe...

2022-07-11 Thread Enrico Minack
All you need to do is implement a method readJson that reads a single file given its path. Than, you map the values of column file_path to the respective JSON content as a string. This can be done via an UDF or simply Dataset.map: case class RowWithJsonUri(entity_id: String, file_path: String,

reading each JSON file from dataframe...

2022-07-10 Thread Muthu Jayakumar
Hello there, I have a dataframe with the following... +-+---+---+ |entity_id|file_path |other_useful_id| +