Efficient saveAsTextFile by key, directory for each key?

Arun Luthra Tue, 21 Apr 2015 18:01:43 -0700

Is there an efficient way to save an RDD with saveAsTextFile in such a way
that the data gets shuffled into separated directories according to a key?
(My end goal is to wrap the result in a multi-partitioned Hive table)


Suppose you have:

case class MyData(val0: Int, val1: string, directory_name: String)

and an RDD called myrdd with type RDD[MyData]. Suppose that you already
have an array of the distinct directory_name's, called distinct_directories.

A very inefficient way to to this is:

distinct_directories.foreach(
  dir_name => myrdd.filter( mydata => mydata.directory_name == dir_name )
    .map( mydata => Array(mydata.val0.toString, mydata.val1).mkString(","))
    .coalesce(5)
    .saveAsTextFile("base_dir_name/" + f"$dir_name")
)

I tried this solution, and it does not do the multiple myrdd.filter's in
parallel.

I'm guessing partitionBy might be in the efficient solution if an easy
efficient solution exists...

Thanks,
Arun

Efficient saveAsTextFile by key, directory for each key?

Reply via email to