The file has a specific structure. I outline it below. The input file is basically a representation of a graph.
INT INT (A) LONG (B) A INTs (Degrees) A SHORTINTs (Vertex_Attribute) B INTs B INTs B SHORTINTs B SHORTINTs A - number of vertices B - number of edges (note that the INTs/SHORTINTs associated with this are edge attributes) After reading in the file, I need to create two RDDs (one with vertices and the other with edges) On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman <freeman.jer...@gmail.com> wrote: > Hm, that will indeed be trickier because this method assumes records are > the same byte size. Is the file an arbitrary sequence of mixed types, or is > there structure, e.g. short, long, short, long, etc.? > > If you could post a gist with an example of the kind of file and how it > should look once read in that would be useful! > > ------------------------- > jeremyfreeman.net > @thefreemanlab > > On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote: > > Thanks for the reply. Unfortunately, in my case, the binary file is a mix > of short and long integers. Is there any other way that could of use here? > > My current method happens to have a large overhead (much more than actual > computation time). Also, I am short of memory at the driver when it has to > read the entire file. > > On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman <freeman.jer...@gmail.com> > wrote: > >> If it’s a flat binary file and each record is the same length (in bytes), >> you can use Spark’s binaryRecords method (defined on the SparkContext), >> which loads records from one or more large flat binary files into an RDD. >> Here’s an example in python to show how it works: >> >> # write data from an array >> from numpy import random >> dat = random.randn(100,5) >> f = open('test.bin', 'w') >> f.write(dat) >> f.close() >> >> >> # load the data back in >> >> from numpy import frombuffer >> >> nrecords = 5 >> bytesize = 8 >> recordsize = nrecords * bytesize >> data = sc.binaryRecords('test.bin', recordsize) >> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float')) >> >> >> # these should be equal >> parsed.first() >> dat[0,:] >> >> >> Does that help? >> >> ------------------------- >> jeremyfreeman.net >> @thefreemanlab >> >> On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote: >> >> What are some efficient ways to read a large file into RDDs? >> >> For example, have several executors read a specific/unique portion of the >> file and construct RDDs. Is this possible to do in Spark? >> >> Currently, I am doing a line-by-line read of the file at the driver and >> constructing the RDD. >> >> >> > >