Re: Arrow as a common open standard for machine learning data

Wes McKinney Sun, 16 Jun 2019 05:27:16 -0700

hi Micah and Joaquin,

With regards to the Feather format, I have been waiting a _long_ time
for the R community to "catch up" with Apache Arrow development and
get a release of an Arrow R project out that can be installed by most
R users. We are finally approaching that point, and so Feather
development has been in a holding pattern for more than 3 years as a
result of this.


Since Feather is popular in practice, my idea has been to preserve the
file format name and have it be a simple container around the Arrow
IPC file format. So Feather would become a stable binary format once
we release a 1.0.0 protocol version. If the goal is to have a stable
memory-mappable binary format, then at that point Feather is something
I'd recommend.  If the Arrow protocol acquires compression, then
Feather files will get compression. My plan is to conduct this Feather
evolution after the 0.14.0 release

I would recommend using Parquet files in general without hesitation,
though they trade deserialization cost for compactness. Arrow and
Parquet are designed to be used together.

- Wes

On Sat, Jun 15, 2019 at 11:07 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Joaquin,
> Answers inline:
>
> Thanks, that explains the arrow-parquet relationship very nicely.
> > So, at the moment you would recommend Parquet for any form of archival
> > storage, right?
>
> Yes Parquet should be used as an archival format.
>
> * Is Feather a good choice for long-term storage (is the binary format
> > stable)?
>
> It is worth mentioning that the Arrow file format and Feather format are
> not the same thing.  My understanding is feather is not being actively
> developed and the idea is it will be deprecated once there is wider support
> for the Arrow file format.
>
> * What meta-data is stored? Are the column names and data types always
> > stored? For categorical columns, can I store the named categories/levels?
> > Is there a way to append additional meta-data, or is it best to store that
> > in a separate file (e.g. json)?
>
> The Arrow file format (
> https://arrow.apache.org/docs/format/IPC.html#file-format) always has a
> schema as the first message which denotes column names and data types.
> Categorical columns can be supported via dictionary encoding.  Custom
> metadata is support at the Schema (
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L323), Column
> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291) and
> batch level (
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L98).  There
> was also a recent proposal to add it to the Footer of the file as well.
>
>
> > * What is the status of support for sparse data? Can I store large sparse
> > datasets efficiently? I noticed sparse_tensor in Arrow. Is it available in
> > Feather/Parquet?
>
> I don't know the answer to this for Feather or parquet.  Currently Arrow
> doesn't support Sparse Data other then the sparse tensor your mentioned
> (there has been some discussion on adding more support for it it the past
> but not enough developer bandwidth to follow-up on it.).
>
> * What is the status of compression for Feather? We have some datasets that
> > are quite large (several GB), and we'd like to onboard datasets like
> > Imagenet which are 130GB in TFRecord format (but I read that parquet can
> > store it in about 40GB).
>
> I don't know about feather, but compression is not directly supported
> within Arrow format at the moment (it has the same status as sparseness,
> i.e. it has been discussed but nobody has worked on it).  You can always
> compress the entire file externally though.
>
> * Would it make more sense to use both Parquet and Feather, depending on
> > the dataset size or dimensionality? If so, what would be a good
> > trade-off/threshold in that case?
>
> See above, for archival purposes Parquet is still probably preferred.
>
>
> > * Most of our datasets are standard dataframes, but some are also
> > collections of images or texts. I guess we have to 'manually' convert those
> > to dataframes first, right? Or do you know of existing tools to facilitate
> > this?
>
> Without more details I would guess you would need to manually convert these
> to dataframes frist.
>
>
> > *  Can Feather files already be read in Java/Go/C#/...?
>
> I don't know the status of feather.   The arrow file format should be
> readable by Java and C++ (I believe all the languages that bind C++ also
> support the format, these include python, ruby and R) .  A quick code
> search of the repo makes me think that there is also support for C#, Rust
> and Javascript. It doesn't look like the file format isn't supported in Go
> yet but it probably wouldn't be too hard to do.
>
> Thanks,
> Micah
>
> On Wed, Jun 12, 2019 at 12:02 PM Joaquin Vanschoren <
> joaquin.vanscho...@gmail.com> wrote:
>
> > Hi Neal,
> >
> > Thanks, that explains the arrow-parquet relationship very nicely.
> > So, at the moment you would recommend Parquet for any form of archival
> > storage, right?
> > We could also experiment with storing data as both Parquet and Arrow for
> > now.
> >
> > Still curious about the other questions, like meta-data, sparse data,
> > Feather support, etc.
> >
> > Cheers,
> > Joaquin
> >
> >
> >
> >
> > On Wed, 12 Jun 2019 at 20:25, Neal Richardson <neal.p.richard...@gmail.com
> > >
> > wrote:
> >
> > > Hi Joaquin,
> > > I recognize that this doesn't answer all of your questions, but we are in
> > > the process of adding a FAQ to the arrow.apache.org website that speaks
> > to
> > > some of them: https://github.com/apache/arrow/blob/master/site/faq.md
> > >
> > > Neal
> > >
> > > On Wed, Jun 12, 2019 at 3:39 AM Joaquin Vanschoren <
> > > joaquin.vanscho...@gmail.com> wrote:
> > >
> > > > Dear all,
> > > >
> > > > Thanks for creating Arrow! I'm part of OpenML.org, an open source
> > > > initiative/platform for sharing machine learning datasets and models.
> > We
> > > > are currently storing data in either ARFF or Parquet, but are looking
> > > into
> > > > whether e.g. Feather or a mix of Feather and Parquet could be the new
> > > > standard for all(?) our datasets (currently about 20000 of them). We
> > had
> > > a
> > > > few questions though, and would definitely like to hear your opinion.
> > > > Apologies in advance if there were recent announcements about these
> > that
> > > I
> > > > missed.
> > > >
> > > > * Is Feather a good choice for long-term storage (is the binary format
> > > > stable)?
> > > > * What meta-data is stored? Are the column names and data types always
> > > > stored? For categorical columns, can I store the named
> > categories/levels?
> > > > Is there a way to append additional meta-data, or is it best to store
> > > that
> > > > in a separate file (e.g. json)?
> > > > * What is the status of support for sparse data? Can I store large
> > sparse
> > > > datasets efficiently? I noticed sparse_tensor in Arrow. Is it available
> > > in
> > > > Feather/Parquet?
> > > > * What is the status of compression for Feather? We have some datasets
> > > that
> > > > are quite large (several GB), and we'd like to onboard datasets like
> > > > Imagenet which are 130GB in TFRecord format (but I read that parquet
> > can
> > > > store it in about 40GB).
> > > > * Would it make more sense to use both Parquet and Feather, depending
> > on
> > > > the dataset size or dimensionality? If so, what would be a good
> > > > trade-off/threshold in that case?
> > > > * Most of our datasets are standard dataframes, but some are also
> > > > collections of images or texts. I guess we have to 'manually' convert
> > > those
> > > > to dataframes first, right? Or do you know of existing tools to
> > > facilitate
> > > > this?
> > > > *  Can Feather files already be read in Java/Go/C#/...?
> > > >
> > > > Thanks!
> > > > Joaquin
> > > >
> > >
> >

Re: Arrow as a common open standard for machine learning data

Reply via email to