Hi Ulvanov, great question, we’ve encountered it frequently with scientific 
data (e.g. time series). Agreed text is inefficient for dense arrays, and we 
also found HDF5+Spark to be a pain.
 
Our strategy has been flat binary files with fixed length records. Loading 
these is now supported in Spark via the binaryRecords method, which wraps a 
custom Hadoop InputFormat we wrote.

An example (in python):

> # write data from an array
> from numpy import random
> dat = random.randn(100,5)
> f = open('test.bin', 'w')
> f.write(dat)
> f.close()

> # load the data back in
> from numpy import frombuffer
> nrecords = 5
> bytesize = 8
> recordsize = nrecords * bytesize
> data = sc.binaryRecords('test.bin', recordsize)
> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))

> # these should be equal
> parsed.first()
> dat[0,:]

Compared to something like Parquet, this is a little lighter-weight, and plays 
nicer with non-distributed data science tools (e.g. numpy). It also scales 
great (we use it routinely to process TBs of time series). And handles single 
files or directories. But it's extremely simple!

-------------------------
jeremyfreeman.net
@thefreemanlab

On Mar 26, 2015, at 2:33 PM, Ulanov, Alexander <alexander.ula...@hp.com> wrote:

> Thanks for suggestion, but libsvm is a format for sparse data storing in text 
> file and I have dense vectors. In my opinion, text format is not appropriate 
> for storing large dense vectors due to overhead related to parsing from 
> string to digits and also storing digits as strings is not efficient.
> 
> From: Stephen Boesch [mailto:java...@gmail.com]
> Sent: Thursday, March 26, 2015 2:27 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org
> Subject: Re: Storing large data for MLlib machine learning
> 
> There are some convenience methods you might consider including:
> 
>           MLUtils.loadLibSVMFile
> 
> and   MLUtils.loadLabeledPoint
> 
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
> <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>:
> Hi,
> 
> Could you suggest what would be the reasonable file format to store feature 
> vector data for machine learning in Spark MLlib? Are there any best practices 
> for Spark?
> 
> My data is dense feature vectors with labels. Some of the requirements are 
> that the format should be easy loaded/serialized, randomly accessible, with a 
> small footprint (binary). I am considering Parquet, hdf5, protocol buffer 
> (protobuf), but I have little to no experience with them, so any suggestions 
> would be really appreciated.
> 
> Best regards, Alexander
> 

Reply via email to