Re: Storing large data for MLlib machine learning

Jeremy Freeman Wed, 01 Apr 2015 13:39:12 -0700

@Alexander, re: using flat binary and metadata, you raise excellent points! At 
least in our case, we decided on a specific endianness, but do end up storing 
some extremely minimal specification in a JSON file, and have written importers 
and exporters within our library to parse it. While it does feel a little like 
reinvention, it’s fast, direct, and scalable, and seems pretty sensible if you 
know your data will be dense arrays of numerical features.


-------------------------
jeremyfreeman.net
@thefreemanlab

On Apr 1, 2015, at 3:52 PM, Hector Yee <hector....@gmail.com> wrote:

> Just using sc.textfile then a .map(decode)
> Yes by default it is multiple files .. our training data is 1TB gzipped
> into 5000 shards.
> 
> On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander <alexander.ula...@hp.com>
> wrote:
> 
>> Thanks, sounds interesting! How do you load files to Spark? Did you
>> consider having multiple files instead of file lines?
>> 
>> 
>> 
>> *From:* Hector Yee [mailto:hector....@gmail.com]
>> *Sent:* Wednesday, April 01, 2015 11:36 AM
>> *To:* Ulanov, Alexander
>> *Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
>> 
>> *Subject:* Re: Storing large data for MLlib machine learning
>> 
>> 
>> 
>> I use Thrift and then base64 encode the binary and save it as text file
>> lines that are snappy or gzip encoded.
>> 
>> 
>> 
>> It makes it very easy to copy small chunks locally and play with subsets
>> of the data and not have dependencies on HDFS / hadoop for server stuff for
>> example.
>> 
>> 
>> 
>> 
>> 
>> On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <
>> alexander.ula...@hp.com> wrote:
>> 
>> Thanks, Evan. What do you think about Protobuf? Twitter has a library to
>> manage protobuf files in hdfshttps://github.com/twitter/elephant-bird
>> 
>> 
>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
>> Sent: Thursday, March 26, 2015 2:34 PM
>> To: Stephen Boesch
>> Cc: Ulanov, Alexander; dev@spark.apache.org
>> Subject: Re: Storing large data for MLlib machine learning
>> 
>> On binary file formats - I looked at HDF5+Spark a couple of years ago and
>> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
>> needed filenames as input, you couldn't pass it anything like an
>> InputStream). I don't know if it has gotten any better.
>> 
>> Parquet plays much more nicely and there are lots of spark-related
>> projects using it already. Keep in mind that it's column-oriented which
>> might impact performance - but basically you're going to want your features
>> in a byte array and deser should be pretty straightforward.
>> 
>> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <java...@gmail.com<mailto:
>> java...@gmail.com>> wrote:
>> There are some convenience methods you might consider including:
>> 
>>           MLUtils.loadLibSVMFile
>> 
>> and   MLUtils.loadLabeledPoint
>> 
>> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ula...@hp.com
>> <mailto:alexander.ula...@hp.com>>:
>> 
>> 
>>> Hi,
>>> 
>>> Could you suggest what would be the reasonable file format to store
>>> feature vector data for machine learning in Spark MLlib? Are there any
>> best
>>> practices for Spark?
>>> 
>>> My data is dense feature vectors with labels. Some of the requirements
>> are
>>> that the format should be easy loaded/serialized, randomly accessible,
>> with
>>> a small footprint (binary). I am considering Parquet, hdf5, protocol
>> buffer
>>> (protobuf), but I have little to no experience with them, so any
>>> suggestions would be really appreciated.
>>> 
>>> Best regards, Alexander
>>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> 
>> Yee Yang Li Hector <http://google.com/+HectorYee>
>> 
>> *google.com/+HectorYee <http://google.com/+HectorYee>*
>> 
> 
> 
> 
> -- 
> Yee Yang Li Hector <http://google.com/+HectorYee>
> *google.com/+HectorYee <http://google.com/+HectorYee>*

Re: Storing large data for MLlib machine learning

Reply via email to