1. create a temp dir on HDFS, say “/tmp”
2. write a script to create in the temp dir one file for each tar file. Each
file has only one line:
<absolute path of the tar file>
3. Write a spark application. It is like:
val rdd = sc.textFile (<HDFS path of the temp dir>)
rdd.map { line =>
construct an untar command using the path information in “line” and
launches the command
}
> On May 19, 2016, at 14:42, ayan guha <[email protected]> wrote:
>
> Hi
>
> I have few tar files in HDFS in a single folder. each file has multiple files
> in it.
>
> tar1:
> - f1.txt
> - f2.txt
> tar2:
> - f1.txt
> - f2.txt
>
> (each tar file will have exact same number of files, same name)
>
> I am trying to find a way (spark or pig) to extract them to their own
> folders.
>
> f1
> - tar1_f1.txt
> - tar2_f1.txt
> f2:
> - tar1_f2.txt
> - tar1_f2.txt
>
> Any help?
>
>
>
> --
> Best Regards,
> Ayan Guha
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]