Can you do dedupe process locally for each file first and then globally?
Also I did not fully get the logic of the part inside reducebykey. Can you
kindly explain?
On 14 Jun 2015 13:58, "Gavin Yue" <[email protected]> wrote:
> I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So
> totally 5TB data.
>
> The data is formatted as key t/ value. After union, I want to remove
> the duplicates among keys. So each key should be unique and has only one
> value.
>
> Here is what I am doing.
>
> folders = Array("folder1","folder2"...."folder10" )
>
> var rawData = sc.textFile(folders(0)).map(x => (x.split("\t")(0),
> x.split("\t")(1)))
>
> for (a <- 1 to sud_paths.length - 1) {
> rawData = rawData.union(sc.textFile(folders (a)).map(x =>
> (x.split("\t")(0), x.split("\t")(1))))
> }
>
> val nodups = rawData.reduceByKey((a,b)=>
> {
> if(a.length > b.length)
> {a}
> else
> {b}
> }
> )
> nodups.saveAsTextFile("/nodups")
>
> Anything I could do to make this process faster? Right now my process
> dies when output the data to the HDFS.
>
>
> Thank you !
>