Can you do dedupe process locally for each file first and then globally?
Also I did not fully get the logic of the part inside reducebykey. Can you
kindly explain?
On 14 Jun 2015 13:58, "Gavin Yue" <yue.yuany...@gmail.com> wrote:

> I have 10 folder, each with 6000 files. Each folder is roughly 500GB.  So
> totally 5TB data.
>
> The data is  formatted as  key t/ value.  After union,  I want to remove
> the duplicates among keys. So each key should be unique and  has only one
> value.
>
> Here is what I am doing.
>
> folders = Array("folder1","folder2"...."folder10" )
>
> var rawData = sc.textFile(folders(0)).map(x => (x.split("\t")(0),
> x.split("\t")(1)))
>
> for (a <- 1 to sud_paths.length - 1) {
>   rawData = rawData.union(sc.textFile(folders (a)).map(x =>
> (x.split("\t")(0), x.split("\t")(1))))
> }
>
> val nodups = rawData.reduceByKey((a,b)=>
> {
>   if(a.length > b.length)
>   {a}
>   else
>   {b}
>   }
> )
> nodups.saveAsTextFile("/nodups")
>
> Anything I could do to make this process faster?   Right now my process
> dies when output the data to the HDFS.
>
>
> Thank you !
>

Reply via email to