Just using sc.textfile then a .map(decode) Yes by default it is multiple files .. our training data is 1TB gzipped into 5000 shards.
On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander <alexander.ula...@hp.com> wrote: > Thanks, sounds interesting! How do you load files to Spark? Did you > consider having multiple files instead of file lines? > > > > *From:* Hector Yee [mailto:hector....@gmail.com] > *Sent:* Wednesday, April 01, 2015 11:36 AM > *To:* Ulanov, Alexander > *Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org > > *Subject:* Re: Storing large data for MLlib machine learning > > > > I use Thrift and then base64 encode the binary and save it as text file > lines that are snappy or gzip encoded. > > > > It makes it very easy to copy small chunks locally and play with subsets > of the data and not have dependencies on HDFS / hadoop for server stuff for > example. > > > > > > On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander < > alexander.ula...@hp.com> wrote: > > Thanks, Evan. What do you think about Protobuf? Twitter has a library to > manage protobuf files in hdfs https://github.com/twitter/elephant-bird > > > From: Evan R. Sparks [mailto:evan.spa...@gmail.com] > Sent: Thursday, March 26, 2015 2:34 PM > To: Stephen Boesch > Cc: Ulanov, Alexander; dev@spark.apache.org > Subject: Re: Storing large data for MLlib machine learning > > On binary file formats - I looked at HDF5+Spark a couple of years ago and > found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs > needed filenames as input, you couldn't pass it anything like an > InputStream). I don't know if it has gotten any better. > > Parquet plays much more nicely and there are lots of spark-related > projects using it already. Keep in mind that it's column-oriented which > might impact performance - but basically you're going to want your features > in a byte array and deser should be pretty straightforward. > > On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <java...@gmail.com<mailto: > java...@gmail.com>> wrote: > There are some convenience methods you might consider including: > > MLUtils.loadLibSVMFile > > and MLUtils.loadLabeledPoint > > 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ula...@hp.com > <mailto:alexander.ula...@hp.com>>: > > > > Hi, > > > > Could you suggest what would be the reasonable file format to store > > feature vector data for machine learning in Spark MLlib? Are there any > best > > practices for Spark? > > > > My data is dense feature vectors with labels. Some of the requirements > are > > that the format should be easy loaded/serialized, randomly accessible, > with > > a small footprint (binary). I am considering Parquet, hdf5, protocol > buffer > > (protobuf), but I have little to no experience with them, so any > > suggestions would be really appreciated. > > > > Best regards, Alexander > > > > > > > > -- > > Yee Yang Li Hector <http://google.com/+HectorYee> > > *google.com/+HectorYee <http://google.com/+HectorYee>* > -- Yee Yang Li Hector <http://google.com/+HectorYee> *google.com/+HectorYee <http://google.com/+HectorYee>*