RE: How to 'Pipe' Binary Data in Apache Spark

Venkat, Ankam Wed, 21 Jan 2015 11:08:31 -0800

I am trying to solve similar problem.  I am using option # 2 as suggested by 
Nick.

I have created an RDD with sc.binaryFiles for a list of .wav files.  But, I am 
not able to pipe it to the external programs.

For example:
>>> sq = sc.binaryFiles("wavfiles")  <-- All .wav files stored on “wavfiles” 
>>> directory on HDFS
>>> sq.keys().collect() <-- works fine.  Shows the list of file names.
>>> sq.values().collect() <-- works fine.  Shows the content of the files.
>>> sq.values().pipe(lambda x: subprocess.call(['/usr/local/bin/sox', '-t' 
>>> 'wav', '-', '-n', 'stats'])).collect()  <-- Does not work.  Tried different 
>>> options.
AttributeError: 'function' object has no attribute 'read'

Any suggestions?

Regards,
Venkat Ankam

From: Nick Allen [mailto:[email protected]]
Sent: Friday, January 16, 2015 11:46 AM
To: [email protected]
Subject: Re: How to 'Pipe' Binary Data in Apache Spark

I just wanted to reiterate the solution for the benefit of the community.

The problem is not from my use of 'pipe', but that 'textFile' cannot be used to 
read in binary data. (Doh) There are a couple options to move forward.

1. Implement a custom 'InputFormat' that understands the binary input data. 
(Per Sean Owen)

2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a single 
record. This will impact performance as it prevents the use of more than one 
mapper on the file's data.

In my specific case for #1 I can only find one project from RIPE-NCC 
(https://github.com/RIPE-NCC/hadoop-pcap) that does this. Unfortunately, it 
appears to only support a limited set of network protocols.

On Fri, Jan 16, 2015 at 10:40 AM, Nick Allen 
<[email protected]<mailto:[email protected]>> wrote:
Per your last comment, it appears I need something like this:

https://github.com/RIPE-NCC/hadoop-pcap

Thanks a ton.  That get me oriented in the right direction.

On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen 
<[email protected]<mailto:[email protected]>> wrote:
Well it looks like you're reading some kind of binary file as text.
That isn't going to work, in Spark or elsewhere, as binary data is not
even necessarily the valid encoding of a string. There are no line
breaks to delimit lines and thus elements of the RDD.

Your input has some record structure (or else it's not really useful
to put it into an RDD). You can encode this as a SequenceFile and read
it with objectFile.

You could also write a custom InputFormat that knows how to parse pcap
records directly.

On Fri, Jan 16, 2015 at 3:09 PM, Nick Allen 
<[email protected]<mailto:[email protected]>> wrote:
> I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe
> that binary data to an external program that will translate it to
> string/text data. Unfortunately, it seems that Spark is mangling the binary
> data before it gets passed to the external program.
>
> This code is representative of what I am trying to do. What am I doing
> wrong? How can I pipe binary data in Spark?  Maybe it is getting corrupted
> when I read it in initially with 'textFile'?
>
> bin = sc.textFile("binary-data.dat")
> csv = bin.pipe ("/usr/bin/binary-to-csv.sh")
> csv.saveAsTextFile("text-data.csv")
>
> Specifically, I am trying to use Spark to transform pcap (packet capture)
> data to text/csv so that I can perform an analysis on it.
>
> Thanks!
>
> --
> Nick Allen <[email protected]<mailto:[email protected]>>

--
Nick Allen <[email protected]<mailto:[email protected]>>

--
Nick Allen <[email protected]<mailto:[email protected]>>
This communication is the property of CenturyLink and may contain confidential 
or privileged information. Unauthorized use of this communication is strictly 
prohibited and may be unlawful. If you have received this communication in 
error, please immediately notify the sender by reply e-mail and destroy all 
copies of the communication and any attachments.

RE: How to 'Pipe' Binary Data in Apache Spark

Reply via email to