@Alexander, re: using flat binary and metadata, you raise excellent points! At least in our case, we decided on a specific endianness, but do end up storing some extremely minimal specification in a JSON file, and have written importers and exporters within our library to parse it. While it does feel a little like reinvention, it’s fast, direct, and scalable, and seems pretty sensible if you know your data will be dense arrays of numerical features.
------------------------- jeremyfreeman.net @thefreemanlab On Apr 1, 2015, at 3:52 PM, Hector Yee <hector....@gmail.com> wrote: > Just using sc.textfile then a .map(decode) > Yes by default it is multiple files .. our training data is 1TB gzipped > into 5000 shards. > > On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander <alexander.ula...@hp.com> > wrote: > >> Thanks, sounds interesting! How do you load files to Spark? Did you >> consider having multiple files instead of file lines? >> >> >> >> *From:* Hector Yee [mailto:hector....@gmail.com] >> *Sent:* Wednesday, April 01, 2015 11:36 AM >> *To:* Ulanov, Alexander >> *Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org >> >> *Subject:* Re: Storing large data for MLlib machine learning >> >> >> >> I use Thrift and then base64 encode the binary and save it as text file >> lines that are snappy or gzip encoded. >> >> >> >> It makes it very easy to copy small chunks locally and play with subsets >> of the data and not have dependencies on HDFS / hadoop for server stuff for >> example. >> >> >> >> >> >> On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander < >> alexander.ula...@hp.com> wrote: >> >> Thanks, Evan. What do you think about Protobuf? Twitter has a library to >> manage protobuf files in hdfshttps://github.com/twitter/elephant-bird >> >> >> From: Evan R. Sparks [mailto:evan.spa...@gmail.com] >> Sent: Thursday, March 26, 2015 2:34 PM >> To: Stephen Boesch >> Cc: Ulanov, Alexander; dev@spark.apache.org >> Subject: Re: Storing large data for MLlib machine learning >> >> On binary file formats - I looked at HDF5+Spark a couple of years ago and >> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs >> needed filenames as input, you couldn't pass it anything like an >> InputStream). I don't know if it has gotten any better. >> >> Parquet plays much more nicely and there are lots of spark-related >> projects using it already. Keep in mind that it's column-oriented which >> might impact performance - but basically you're going to want your features >> in a byte array and deser should be pretty straightforward. >> >> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <java...@gmail.com<mailto: >> java...@gmail.com>> wrote: >> There are some convenience methods you might consider including: >> >> MLUtils.loadLibSVMFile >> >> and MLUtils.loadLabeledPoint >> >> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ula...@hp.com >> <mailto:alexander.ula...@hp.com>>: >> >> >>> Hi, >>> >>> Could you suggest what would be the reasonable file format to store >>> feature vector data for machine learning in Spark MLlib? Are there any >> best >>> practices for Spark? >>> >>> My data is dense feature vectors with labels. Some of the requirements >> are >>> that the format should be easy loaded/serialized, randomly accessible, >> with >>> a small footprint (binary). I am considering Parquet, hdf5, protocol >> buffer >>> (protobuf), but I have little to no experience with them, so any >>> suggestions would be really appreciated. >>> >>> Best regards, Alexander >>> >> >> >> >> >> >> -- >> >> Yee Yang Li Hector <http://google.com/+HectorYee> >> >> *google.com/+HectorYee <http://google.com/+HectorYee>* >> > > > > -- > Yee Yang Li Hector <http://google.com/+HectorYee> > *google.com/+HectorYee <http://google.com/+HectorYee>*