I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So
totally 5TB data.
The data is formatted as key t/ value. After union, I want to remove
the duplicates among keys. So each key should be unique and has only one
value.
Here is what I am doing.
folders = Array("folder1","folder2"...."folder10" )
var rawData = sc.textFile(folders(0)).map(x => (x.split("\t")(0),
x.split("\t")(1)))
for (a <- 1 to sud_paths.length - 1) {
rawData = rawData.union(sc.textFile(folders (a)).map(x =>
(x.split("\t")(0), x.split("\t")(1))))
}
val nodups = rawData.reduceByKey((a,b)=>
{
if(a.length > b.length)
{a}
else
{b}
}
)
nodups.saveAsTextFile("/nodups")
Anything I could do to make this process faster? Right now my process
dies when output the data to the HDFS.
Thank you !