Re: Reading a large file (binary) into RDD

Vijayasarathy Kannan Thu, 02 Apr 2015 15:08:29 -0700

The file has a specific structure. I outline it below.

The input file is basically a representation of a graph.


INT
INT    (A)
LONG (B)
A INTs                    (Degrees)
A SHORTINTs          (Vertex_Attribute)
B INTs
B INTs
B SHORTINTs
B SHORTINTs

A - number of vertices
B - number of edges (note that the INTs/SHORTINTs associated with this are
edge attributes)

After reading in the file, I need to create two RDDs (one with vertices and
the other with edges)

On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman <freeman.jer...@gmail.com>
wrote:

> Hm, that will indeed be trickier because this method assumes records are
> the same byte size. Is the file an arbitrary sequence of mixed types, or is
> there structure, e.g. short, long, short, long, etc.?
>
> If you could post a gist with an example of the kind of file and how it
> should look once read in that would be useful!
>
> -------------------------
> jeremyfreeman.net
> @thefreemanlab
>
> On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote:
>
> Thanks for the reply. Unfortunately, in my case, the binary file is a mix
> of short and long integers. Is there any other way that could of use here?
>
> My current method happens to have a large overhead (much more than actual
> computation time). Also, I am short of memory at the driver when it has to
> read the entire file.
>
> On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman <freeman.jer...@gmail.com>
> wrote:
>
>> If it’s a flat binary file and each record is the same length (in bytes),
>> you can use Spark’s binaryRecords method (defined on the SparkContext),
>> which loads records from one or more large flat binary files into an RDD.
>> Here’s an example in python to show how it works:
>>
>> # write data from an array
>> from numpy import random
>> dat = random.randn(100,5)
>> f = open('test.bin', 'w')
>> f.write(dat)
>> f.close()
>>
>>
>> # load the data back in
>>
>> from numpy import frombuffer
>>
>> nrecords = 5
>> bytesize = 8
>> recordsize = nrecords * bytesize
>> data = sc.binaryRecords('test.bin', recordsize)
>> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))
>>
>>
>> # these should be equal
>> parsed.first()
>> dat[0,:]
>>
>>
>> Does that help?
>>
>> -------------------------
>> jeremyfreeman.net
>> @thefreemanlab
>>
>> On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote:
>>
>> What are some efficient ways to read a large file into RDDs?
>>
>> For example, have several executors read a specific/unique portion of the
>> file and construct RDDs. Is this possible to do in Spark?
>>
>> Currently, I am doing a line-by-line read of the file at the driver and
>> constructing the RDD.
>>
>>
>>
>
>

Re: Reading a large file (binary) into RDD

Reply via email to