Can you do dedupe process locally for each file first and then globally? Also I did not fully get the logic of the part inside reducebykey. Can you kindly explain? On 14 Jun 2015 13:58, "Gavin Yue" <yue.yuany...@gmail.com> wrote:
> I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So > totally 5TB data. > > The data is formatted as key t/ value. After union, I want to remove > the duplicates among keys. So each key should be unique and has only one > value. > > Here is what I am doing. > > folders = Array("folder1","folder2"...."folder10" ) > > var rawData = sc.textFile(folders(0)).map(x => (x.split("\t")(0), > x.split("\t")(1))) > > for (a <- 1 to sud_paths.length - 1) { > rawData = rawData.union(sc.textFile(folders (a)).map(x => > (x.split("\t")(0), x.split("\t")(1)))) > } > > val nodups = rawData.reduceByKey((a,b)=> > { > if(a.length > b.length) > {a} > else > {b} > } > ) > nodups.saveAsTextFile("/nodups") > > Anything I could do to make this process faster? Right now my process > dies when output the data to the HDFS. > > > Thank you ! >