Arrow as a common open standard for machine learning data

Joaquin Vanschoren Wed, 12 Jun 2019 03:39:09 -0700

Dear all,

Thanks for creating Arrow! I'm part of OpenML.org, an open source
initiative/platform for sharing machine learning datasets and models. We
are currently storing data in either ARFF or Parquet, but are looking into
whether e.g. Feather or a mix of Feather and Parquet could be the new
standard for all(?) our datasets (currently about 20000 of them). We had a
few questions though, and would definitely like to hear your opinion.
Apologies in advance if there were recent announcements about these that I
missed.


* Is Feather a good choice for long-term storage (is the binary format
stable)?
* What meta-data is stored? Are the column names and data types always
stored? For categorical columns, can I store the named categories/levels?
Is there a way to append additional meta-data, or is it best to store that
in a separate file (e.g. json)?
* What is the status of support for sparse data? Can I store large sparse
datasets efficiently? I noticed sparse_tensor in Arrow. Is it available in
Feather/Parquet?
* What is the status of compression for Feather? We have some datasets that
are quite large (several GB), and we'd like to onboard datasets like
Imagenet which are 130GB in TFRecord format (but I read that parquet can
store it in about 40GB).
* Would it make more sense to use both Parquet and Feather, depending on
the dataset size or dimensionality? If so, what would be a good
trade-off/threshold in that case?
* Most of our datasets are standard dataframes, but some are also
collections of images or texts. I guess we have to 'manually' convert those
to dataframes first, right? Or do you know of existing tools to facilitate
this?
*  Can Feather files already be read in Java/Go/C#/...?

Thanks!
Joaquin

Arrow as a common open standard for machine learning data

Reply via email to