Hi,
I think that this is a pure example of over engineering.
Ayan's advice is the best. Please use SPARK SQL function called as
input_file_name() to join the tables. People do not think in terms of RDD
anymore unless absolutely required.
Also if you have different JSON schemas, just use the SPAR
Hello Ayan,
Thank you for the suggestion. But, I would lose correlation of the JSON
file with the other identifier fields. Also, if there are too many files,
will it be an issue? Plus, I may not have the same schema across all the
files.
Hello Enrico,
>how does RDD's mapPartitions make a differe
Another option is:
1. collect the dataframe with file path
2. create a list of paths
3. create a new dataframe with spark.read.json and pass the list of path
This will save you lots of headache
Ayan
On Wed, Jul 13, 2022 at 7:35 AM Enrico Minack
wrote:
> Hi,
>
> how does RDD's mapPartitions m
Hi,
how does RDD's mapPartitions make a difference regarding 1. and 2.
compared to Dataset's mapPartitions / map function?
Enrico
Am 12.07.22 um 22:13 schrieb Muthu Jayakumar:
Hello Enrico,
Thanks for the reply. I found that I would have to use `mapPartitions`
API of RDD to perform this s
Hello Enrico,
Thanks for the reply. I found that I would have to use `mapPartitions` API
of RDD to perform this safely as I have to
1. Read each file from GCS using HDFS FileSystem API.
2. Parse each JSON record in a safe manner.
For (1) to work, I do have to broadcast HadoopConfiguration from
sp
All you need to do is implement a method readJson that reads a single
file given its path. Than, you map the values of column file_path to the
respective JSON content as a string. This can be done via an UDF or
simply Dataset.map:
case class RowWithJsonUri(entity_id: String, file_path: String,
Hello there,
I have a dataframe with the following...
+-+---+---+
|entity_id|file_path
|other_useful_id|
+