Re: Arrow as a common open standard for machine learning data

2020-07-02 Thread Joaquin Vanschoren
Thanks! > You should be able to store different length vectors in Parquet. Think of > strings simply as an array of bytes, and those are variable length. You > would want to make sure you don’t use DICTIONARY_ENCODING in that case. > Interesting. We'll look at that. > No, I'm not aware of any

Re: Arrow as a common open standard for machine learning data

2020-07-02 Thread Nicholas Poorman
Joaquin, > Do you know whether there any activity on supporting partial read/writes in arrow or fastparquet? I’m not entirely sure about the status of partial read/writes in Arrow’s Parquet implementation but https://github.com/xitongsys/parquet-go for example has this capability. > Even then, t

Re: Arrow as a common open standard for machine learning data

2020-07-02 Thread Joaquin Vanschoren
Hi Nick, all, Thanks! I updated the blog post to specify the requirements better. First, we plan to store the datasets in S3 (on min.io). I agree this works nicely with Parquet. Do you know whether there any activity on supporting partial read/writes in arrow or fastparquet? That would change th

Re: Arrow as a common open standard for machine learning data

2020-06-30 Thread Wes McKinney
On Tue, Jun 30, 2020 at 8:09 AM Nicholas Poorman wrote: > > Joaquin, > > After reading your proposal I think there may be some things you may want > to consider. > > It sounds like you are trying to come up with a one size fits all solution > but it may be better to define your requirements based

Re: Arrow as a common open standard for machine learning data

2020-06-30 Thread Nicholas Poorman
Joaquin, After reading your proposal I think there may be some things you may want to consider. It sounds like you are trying to come up with a one size fits all solution but it may be better to define your requirements based on your needs and environment. For starters, where do you plan to stor

Re: Arrow as a common open standard for machine learning data

2020-06-30 Thread Joaquin Vanschoren
Hi all, Sorry for restarting an old thread, but we've had a _lot_ of discussions over the past 9 months or so on how to store machine learning datasets internally. We've written a blog post about it and would love to hear your thoughts: https://openml.github.io/blog/openml/data/2020/03/23/Finding-

Re: Arrow as a common open standard for machine learning data

2019-06-20 Thread Wes McKinney
hi Joaquin -- there would be no practical difference, primarily it would be for the preservation of APIs in Python and R related to the Feather format. Internally "read_feather" will invoke the same code paths as the Arrow protocol file reader - Wes On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanscho

Re: Arrow as a common open standard for machine learning data

2019-06-20 Thread Joaquin Vanschoren
Thank you all for your very detailed answers! I also read in other threads that the 1.0.0 release might be coming somewhere this fall? I'm really looking forward to that. @Wes: will there be any practical difference between Feather and Arrow after the 1.0.0 release? It is just an alias? What would

Re: Arrow as a common open standard for machine learning data

2019-06-16 Thread Sebastien Binet
hi there, On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield wrote: > > * Can Feather files already be read in Java/Go/C#/...? > > I don't know the status of feather. The arrow file format should be > readable by Java and C++ (I believe all the languages that bind C++ also > support the format,

Re: Arrow as a common open standard for machine learning data

2019-06-16 Thread Wes McKinney
hi Micah and Joaquin, With regards to the Feather format, I have been waiting a _long_ time for the R community to "catch up" with Apache Arrow development and get a release of an Arrow R project out that can be installed by most R users. We are finally approaching that point, and so Feather devel

Re: Arrow as a common open standard for machine learning data

2019-06-15 Thread Micah Kornfield
Hi Joaquin, Answers inline: Thanks, that explains the arrow-parquet relationship very nicely. > So, at the moment you would recommend Parquet for any form of archival > storage, right? Yes Parquet should be used as an archival format. * Is Feather a good choice for long-term storage (is the bina

Re: Arrow as a common open standard for machine learning data

2019-06-12 Thread Joaquin Vanschoren
Hi Neal, Thanks, that explains the arrow-parquet relationship very nicely. So, at the moment you would recommend Parquet for any form of archival storage, right? We could also experiment with storing data as both Parquet and Arrow for now. Still curious about the other questions, like meta-data,

Re: Arrow as a common open standard for machine learning data

2019-06-12 Thread Neal Richardson
Hi Joaquin, I recognize that this doesn't answer all of your questions, but we are in the process of adding a FAQ to the arrow.apache.org website that speaks to some of them: https://github.com/apache/arrow/blob/master/site/faq.md Neal On Wed, Jun 12, 2019 at 3:39 AM Joaquin Vanschoren < joaquin.

Arrow as a common open standard for machine learning data

2019-06-12 Thread Joaquin Vanschoren
Dear all, Thanks for creating Arrow! I'm part of OpenML.org, an open source initiative/platform for sharing machine learning datasets and models. We are currently storing data in either ARFF or Parquet, but are looking into whether e.g. Feather or a mix of Feather and Parquet could be the new stan