Is there an efficient way to save an RDD with saveAsTextFile in such a way
that the data gets shuffled into separated directories according to a key?
(My end goal is to wrap the result in a multi-partitioned Hive table)
Suppose you have:
case class MyData(val0: Int, val1: string, directory_name: String)
and an RDD called myrdd with type RDD[MyData]. Suppose that you already
have an array of the distinct directory_name's, called distinct_directories.
A very inefficient way to to this is:
distinct_directories.foreach(
dir_name => myrdd.filter( mydata => mydata.directory_name == dir_name )
.map( mydata => Array(mydata.val0.toString, mydata.val1).mkString(","))
.coalesce(5)
.saveAsTextFile("base_dir_name/" + f"$dir_name")
)
I tried this solution, and it does not do the multiple myrdd.filter's in
parallel.
I'm guessing partitionBy might be in the efficient solution if an easy
efficient solution exists...
Thanks,
Arun