Storing large data for MLlib machine learning

Ulanov, Alexander Thu, 26 Mar 2015 14:19:53 -0700

Hi,

Could you suggest what would be the reasonable file format to store feature 
vector data for machine learning in Spark MLlib? Are there any best practices 
for Spark?


My data is dense feature vectors with labels. Some of the requirements are that 
the format should be easy loaded/serialized, randomly accessible, with a small 
footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), 
but I have little to no experience with them, so any suggestions would be 
really appreciated.

Best regards, Alexander

Storing large data for MLlib machine learning

Reply via email to