Hi Joaquin, I recognize that this doesn't answer all of your questions, but we are in the process of adding a FAQ to the arrow.apache.org website that speaks to some of them: https://github.com/apache/arrow/blob/master/site/faq.md
Neal On Wed, Jun 12, 2019 at 3:39 AM Joaquin Vanschoren < joaquin.vanscho...@gmail.com> wrote: > Dear all, > > Thanks for creating Arrow! I'm part of OpenML.org, an open source > initiative/platform for sharing machine learning datasets and models. We > are currently storing data in either ARFF or Parquet, but are looking into > whether e.g. Feather or a mix of Feather and Parquet could be the new > standard for all(?) our datasets (currently about 20000 of them). We had a > few questions though, and would definitely like to hear your opinion. > Apologies in advance if there were recent announcements about these that I > missed. > > * Is Feather a good choice for long-term storage (is the binary format > stable)? > * What meta-data is stored? Are the column names and data types always > stored? For categorical columns, can I store the named categories/levels? > Is there a way to append additional meta-data, or is it best to store that > in a separate file (e.g. json)? > * What is the status of support for sparse data? Can I store large sparse > datasets efficiently? I noticed sparse_tensor in Arrow. Is it available in > Feather/Parquet? > * What is the status of compression for Feather? We have some datasets that > are quite large (several GB), and we'd like to onboard datasets like > Imagenet which are 130GB in TFRecord format (but I read that parquet can > store it in about 40GB). > * Would it make more sense to use both Parquet and Feather, depending on > the dataset size or dimensionality? If so, what would be a good > trade-off/threshold in that case? > * Most of our datasets are standard dataframes, but some are also > collections of images or texts. I guess we have to 'manually' convert those > to dataframes first, right? Or do you know of existing tools to facilitate > this? > * Can Feather files already be read in Java/Go/C#/...? > > Thanks! > Joaquin >