RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
reeman.jer...@gmail.com] Sent: Wednesday, April 01, 2015 1:37 PM To: Hector Yee Cc: Ulanov, Alexander; Evan R. Sparks; Stephen Boesch; dev@spark.apache.org Subject: Re: Storing large data for MLlib machine learning @Alexander, re: using flat binary and metadata, you raise excellent points! At least in ou

Re: Storing large data for MLlib machine learning

2015-04-01 Thread Jeremy Freeman
ng multiple files instead of file lines? >> >> >> >> *From:* Hector Yee [mailto:hector@gmail.com] >> *Sent:* Wednesday, April 01, 2015 11:36 AM >> *To:* Ulanov, Alexander >> *Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org >> >> *

Re: Storing large data for MLlib machine learning

2015-04-01 Thread Hector Yee
files instead of file lines? > > > > *From:* Hector Yee [mailto:hector@gmail.com] > *Sent:* Wednesday, April 01, 2015 11:36 AM > *To:* Ulanov, Alexander > *Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org > > *Subject:* Re: Storing large data for MLlib mach

RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
: Re: Storing large data for MLlib machine learning I use Thrift and then base64 encode the binary and save it as text file lines that are snappy or gzip encoded. It makes it very easy to copy small chunks locally and play with subsets of the data and not have dependencies on HDFS / hadoop for

Re: Storing large data for MLlib machine learning

2015-04-01 Thread Hector Yee
4 PM > To: Stephen Boesch > Cc: Ulanov, Alexander; dev@spark.apache.org > Subject: Re: Storing large data for MLlib machine learning > > On binary file formats - I looked at HDF5+Spark a couple of years ago and > found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the A

RE: Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
...@gmail.com] Sent: Thursday, March 26, 2015 3:01 PM To: Ulanov, Alexander Cc: Stephen Boesch; dev@spark.apache.org Subject: Re: Storing large data for MLlib machine learning Hi Ulvanov, great question, we've encountered it frequently with scientific data (e.g. time series). Agreed te

Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks
les in hdfs https://github.com/twitter/elephant-bird > > > > > > *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] > *Sent:* Thursday, March 26, 2015 2:34 PM > *To:* Stephen Boesch > *Cc:* Ulanov, Alexander; dev@spark.apache.org > *Subject:* Re: Storing large data for

Re: Storing large data for MLlib machine learning

2015-03-26 Thread Jeremy Freeman
ot appropriate > for storing large dense vectors due to overhead related to parsing from > string to digits and also storing digits as strings is not efficient. > > From: Stephen Boesch [mailto:java...@gmail.com] > Sent: Thursday, March 26, 2015 2:27 PM > To: Ulanov,

RE: Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
@spark.apache.org Subject: Re: Storing large data for MLlib machine learning On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't pass it anything like an InputStream). I

Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks
On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't pass it anything like an InputStream). I don't know if it has gotten any better. Parquet plays much more nicely a

RE: Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
. From: Stephen Boesch [mailto:java...@gmail.com] Sent: Thursday, March 26, 2015 2:27 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Storing large data for MLlib machine learning There are some convenience methods you might consider including: MLUtils.loadLibSVMFile

Re: Storing large data for MLlib machine learning

2015-03-26 Thread Stephen Boesch
There are some convenience methods you might consider including: MLUtils.loadLibSVMFile and MLUtils.loadLabeledPoint 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander : > Hi, > > Could you suggest what would be the reasonable file format to store > feature vector data for machine learni

Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
Hi, Could you suggest what would be the reasonable file format to store feature vector data for machine learning in Spark MLlib? Are there any best practices for Spark? My data is dense feature vectors with labels. Some of the requirements are that the format should be easy loaded/serialized,