Dear all, Thanks for creating Arrow! I'm part of OpenML.org, an open source initiative/platform for sharing machine learning datasets and models. We are currently storing data in either ARFF or Parquet, but are looking into whether e.g. Feather or a mix of Feather and Parquet could be the new standard for all(?) our datasets (currently about 20000 of them). We had a few questions though, and would definitely like to hear your opinion. Apologies in advance if there were recent announcements about these that I missed.
* Is Feather a good choice for long-term storage (is the binary format stable)? * What meta-data is stored? Are the column names and data types always stored? For categorical columns, can I store the named categories/levels? Is there a way to append additional meta-data, or is it best to store that in a separate file (e.g. json)? * What is the status of support for sparse data? Can I store large sparse datasets efficiently? I noticed sparse_tensor in Arrow. Is it available in Feather/Parquet? * What is the status of compression for Feather? We have some datasets that are quite large (several GB), and we'd like to onboard datasets like Imagenet which are 130GB in TFRecord format (but I read that parquet can store it in about 40GB). * Would it make more sense to use both Parquet and Feather, depending on the dataset size or dimensionality? If so, what would be a good trade-off/threshold in that case? * Most of our datasets are standard dataframes, but some are also collections of images or texts. I guess we have to 'manually' convert those to dataframes first, right? Or do you know of existing tools to facilitate this? * Can Feather files already be read in Java/Go/C#/...? Thanks! Joaquin