Re: Arrow as a common open standard for machine learning data

Joaquin Vanschoren Wed, 12 Jun 2019 12:02:39 -0700

Hi Neal,

Thanks, that explains the arrow-parquet relationship very nicely.
So, at the moment you would recommend Parquet for any form of archival
storage, right?
We could also experiment with storing data as both Parquet and Arrow for
now.


Still curious about the other questions, like meta-data, sparse data,
Feather support, etc.

Cheers,
Joaquin




On Wed, 12 Jun 2019 at 20:25, Neal Richardson <neal.p.richard...@gmail.com>
wrote:

> Hi Joaquin,
> I recognize that this doesn't answer all of your questions, but we are in
> the process of adding a FAQ to the arrow.apache.org website that speaks to
> some of them: https://github.com/apache/arrow/blob/master/site/faq.md
>
> Neal
>
> On Wed, Jun 12, 2019 at 3:39 AM Joaquin Vanschoren <
> joaquin.vanscho...@gmail.com> wrote:
>
> > Dear all,
> >
> > Thanks for creating Arrow! I'm part of OpenML.org, an open source
> > initiative/platform for sharing machine learning datasets and models. We
> > are currently storing data in either ARFF or Parquet, but are looking
> into
> > whether e.g. Feather or a mix of Feather and Parquet could be the new
> > standard for all(?) our datasets (currently about 20000 of them). We had
> a
> > few questions though, and would definitely like to hear your opinion.
> > Apologies in advance if there were recent announcements about these that
> I
> > missed.
> >
> > * Is Feather a good choice for long-term storage (is the binary format
> > stable)?
> > * What meta-data is stored? Are the column names and data types always
> > stored? For categorical columns, can I store the named categories/levels?
> > Is there a way to append additional meta-data, or is it best to store
> that
> > in a separate file (e.g. json)?
> > * What is the status of support for sparse data? Can I store large sparse
> > datasets efficiently? I noticed sparse_tensor in Arrow. Is it available
> in
> > Feather/Parquet?
> > * What is the status of compression for Feather? We have some datasets
> that
> > are quite large (several GB), and we'd like to onboard datasets like
> > Imagenet which are 130GB in TFRecord format (but I read that parquet can
> > store it in about 40GB).
> > * Would it make more sense to use both Parquet and Feather, depending on
> > the dataset size or dimensionality? If so, what would be a good
> > trade-off/threshold in that case?
> > * Most of our datasets are standard dataframes, but some are also
> > collections of images or texts. I guess we have to 'manually' convert
> those
> > to dataframes first, right? Or do you know of existing tools to
> facilitate
> > this?
> > *  Can Feather files already be read in Java/Go/C#/...?
> >
> > Thanks!
> > Joaquin
> >
>

Re: Arrow as a common open standard for machine learning data

Reply via email to