Hi Joaquin,
I recognize that this doesn't answer all of your questions, but we are in
the process of adding a FAQ to the arrow.apache.org website that speaks to
some of them: https://github.com/apache/arrow/blob/master/site/faq.md

Neal

On Wed, Jun 12, 2019 at 3:39 AM Joaquin Vanschoren <
joaquin.vanscho...@gmail.com> wrote:

> Dear all,
>
> Thanks for creating Arrow! I'm part of OpenML.org, an open source
> initiative/platform for sharing machine learning datasets and models. We
> are currently storing data in either ARFF or Parquet, but are looking into
> whether e.g. Feather or a mix of Feather and Parquet could be the new
> standard for all(?) our datasets (currently about 20000 of them). We had a
> few questions though, and would definitely like to hear your opinion.
> Apologies in advance if there were recent announcements about these that I
> missed.
>
> * Is Feather a good choice for long-term storage (is the binary format
> stable)?
> * What meta-data is stored? Are the column names and data types always
> stored? For categorical columns, can I store the named categories/levels?
> Is there a way to append additional meta-data, or is it best to store that
> in a separate file (e.g. json)?
> * What is the status of support for sparse data? Can I store large sparse
> datasets efficiently? I noticed sparse_tensor in Arrow. Is it available in
> Feather/Parquet?
> * What is the status of compression for Feather? We have some datasets that
> are quite large (several GB), and we'd like to onboard datasets like
> Imagenet which are 130GB in TFRecord format (but I read that parquet can
> store it in about 40GB).
> * Would it make more sense to use both Parquet and Feather, depending on
> the dataset size or dimensionality? If so, what would be a good
> trade-off/threshold in that case?
> * Most of our datasets are standard dataframes, but some are also
> collections of images or texts. I guess we have to 'manually' convert those
> to dataframes first, right? Or do you know of existing tools to facilitate
> this?
> *  Can Feather files already be read in Java/Go/C#/...?
>
> Thanks!
> Joaquin
>

Reply via email to