I need to process .wav files in Pyspark. If the files are in local file
system, I am able to process them. Once I store them on HDFS, I am facing
issues. For example,
I run a sox program on a wav file like this.
sox ext2187854_03_27_2014.wav -n stats <-- works fine
sox hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav -n stats <--
Does not work as sox cannot read HDFS file.
So, I do like this.
hadoop fs -cat hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav | sox
-t wav - -n stats <-- This works fine
But, I am not able to do this in PySpark.
wavfile =
sc.textFile('hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav')
wavfile.pipe(subprocess.call(['sox', '-t' 'wav', '-', '-n', 'stats']))
I tried different options like sc.binaryFiles and sc.pickleFile.
Any thoughts?
Regards,
Venkat Ankam
This communication is the property of CenturyLink and may contain confidential
or privileged information. Unauthorized use of this communication is strictly
prohibited and may be unlawful. If you have received this communication in
error, please immediately notify the sender by reply e-mail and destroy all
copies of the communication and any attachments.