I think you can not use textFile() or binaryFile() or pickleFile()
here, it's different format than wav.
You could get a list of paths for all the files, then
sc.parallelize(), and foreach():
def process(path):
# use subprocess to launch a process to do the job, read the
stdout as result
files = [] # a list of path of wav files
sc.parallelize(files, len(files)).foreach(process)
On Fri, Jan 16, 2015 at 2:11 PM, Venkat, Ankam
<[email protected]> wrote:
> I need to process .wav files in Pyspark. If the files are in local file
> system, I am able to process them. Once I store them on HDFS, I am facing
> issues. For example,
>
>
>
> I run a sox program on a wav file like this.
>
>
>
> sox ext2187854_03_27_2014.wav -n stats <-- works fine
>
>
>
> sox hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav -n stats
> <-- Does not work as sox cannot read HDFS file.
>
>
>
> So, I do like this.
>
>
>
> hadoop fs -cat hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav |
> sox -t wav - -n stats <-- This works fine
>
>
>
> But, I am not able to do this in PySpark.
>
>
>
> wavfile =
> sc.textFile('hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav')
>
> wavfile.pipe(subprocess.call(['sox', '-t' 'wav', '-', '-n', 'stats']))
>
>
>
> I tried different options like sc.binaryFiles and sc.pickleFile.
>
>
>
> Any thoughts?
>
>
>
> Regards,
>
> Venkat Ankam
>
>
>
> This communication is the property of CenturyLink and may contain
> confidential or privileged information. Unauthorized use of this
> communication is strictly prohibited and may be unlawful. If you have
> received this communication in error, please immediately notify the sender
> by reply e-mail and destroy all copies of the communication and any
> attachments.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]