Re: Arrow as a common open standard for machine learning data

Joaquin Vanschoren Tue, 30 Jun 2020 03:47:00 -0700

Hi all,

Sorry for restarting an old thread, but we've had a _lot_ of discussions
over the past 9 months or so on how to store machine learning datasets
internally. We've written a blog post about it and would love to hear your
thoughts:
https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html

To be clear: what we need is a data format for archival storage on the
server, and preferably one that supports versioning/diff, multi-table
storage, and sparse data.
Hence, this is for *internal* storage. When OpenML users want to download a
dataset in parquet or arrow we can always convert it on the fly (or from a
cache). We already use Arrow/Feather to cache the datasets after it is
downloaded (when possible).

One specific concern about parquet is that we are not entirely sure
whether a parquet file created by one parser (e.g. in R) can always be read
by another parser (e.g. in Python). We saw some github issues related to
this but we don't know whether this is still an issue. Do you know? Also,
it seems that none of the current python parsers support partial
read/writes, is that correct?

Because of these issues, we are still considering a text-based format (e.g.
CSV) for our main dataset storage, mainly because of its broad native
support in all languages and easy versioning/diffs (we could use git-lfs),
and use parquet/arrow for later usage where possible. We're still doubting
between CSV and Parquet, though.

Do you have any thoughts or comments?

Thanks!
Joaquin

On Thu, 20 Jun 2019 at 23:47, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Joaquin -- there would be no practical difference, primarily it
> would be for the preservation of APIs in Python and R related to the
> Feather format. Internally "read_feather" will invoke the same code
> paths as the Arrow protocol file reader
>
> - Wes
>
> On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
> <joaquin.vanscho...@gmail.com> wrote:
> >
> > Thank you all for your very detailed answers! I also read in other
> threads
> > that the 1.0.0 release might be coming somewhere this fall? I'm really
> > looking forward to that.
> > @Wes: will there be any practical difference between Feather and Arrow
> > after the 1.0.0 release? It is just an alias? What would be the benefits
> of
> > using Feather rather than Arrow at that point?
> >
> > Thanks!
> > Joaquin
> >
> >
> >
> > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch> wrote:
> >
> > > hi there,
> > >
> > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <emkornfi...@gmail.com
> >
> > > wrote:
> > >
> > > > > *  Can Feather files already be read in Java/Go/C#/...?
> > > >
> > > > I don't know the status of feather.   The arrow file format should be
> > > > readable by Java and C++ (I believe all the languages that bind C++
> also
> > > > support the format, these include python, ruby and R) .  A quick code
> > > > search of the repo makes me think that there is also support for C#,
> Rust
> > > > and Javascript. It doesn't look like the file format isn't supported
> in
> > > Go
> > > > yet but it probably wouldn't be too hard to do.
> > > >
> > > Go doesn't handle Feather files.
> > > But there is support (not yet feature complete, see [1]) for Arrow
> files
> > > (r/w):
> > > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
> > >
> > > hth,
> > > -s
> > >
> > > [1]: https://issues.apache.org/jira/browse/ARROW-3679
> > >
>

Re: Arrow as a common open standard for machine learning data

Reply via email to