I think you can not use textFile() or binaryFile() or pickleFile()
here, it's different format than wav.
You could get a list of paths for all the files, then
sc.parallelize(), and foreach():
def process(path):
# use subprocess to launch a process to do the job, read the
stdout as result
fil
I need to process .wav files in Pyspark. If the files are in local file
system, I am able to process them. Once I store them on HDFS, I am facing
issues. For example,
I run a sox program on a wav file like this.
sox ext2187854_03_27_2014.wav -n stats <-- works fine
sox hdfs://xxx:8020/