Re: Arrow as a common open standard for machine learning data

Nicholas Poorman Thu, 02 Jul 2020 13:49:45 -0700

Joaquin,

> Do you know whether there any activity on supporting partial read/writes
in
arrow or fastparquet?


I’m not entirely sure about the status of partial read/writes in Arrow’s
Parquet implementation but
https://github.com/xitongsys/parquet-go for example has this capability.

> Even then, there are different numbers of bounding boxes / tags per image.
> Can you store different-length vectors in Parquet?

You should be able to store different length vectors in Parquet. Think of
strings simply as an array of bytes, and those are variable length. You
would want to make sure you don’t use DICTIONARY_ENCODING in that case.

> I've looked at DeltaLake, but as far as I understand, its commit log
depends on spark operations done on the dataframe? Hence, any change to the
dataset has to be performed via spark? Is that correct?

Until someone replicates the functionality outside of Spark, yes that is
the drawback and why I have been hesitant to adopt DeltaLake.

> Do you know of any tools to compute diffs between Parquet file? What I
could find was basically: export both files to CSV and run git diff.
> DeltaLake would help here, but again, is seems that it only 'tracks' Spark
operations done directly on the file?

No, I'm not aware of any tools that do diffs between Parquet files. I'm not
sure how you could perform a byte for byte diff without reading one into
memory and decoding it. My question here would be who is trying to consume
the diff you want to generate? Is the diff something you want to display to
a user? i.e. column A, row 132 was "foo" but has now changed to "bar" Or
are you looking to apply an update to a dataset? i.e. I recently trained
and stored embeddings and now I need to update them but I don't want to
override the data because I would like to be able to retrieve what they
were in the last training iteration so I can roll back, run parallel tests,
etc..

I believe DeltaLake has a commit log. However, it probably doesn't provide
a diff. The commit log does give them the ability to ask "What did the data
look like at this point in time?".

Thanks, you may mention me as a contributor to the blog post if you'd like!

Best,
Nick Poorman



On Thu, Jul 2, 2020 at 9:40 AM Joaquin Vanschoren <j.vanscho...@tue.nl>
wrote:

> Hi Nick, all,
>
> Thanks! I updated the blog post to specify the requirements better.
>
> First, we plan to store the datasets in S3 (on min.io). I agree this works
> nicely with Parquet.
>
> Do you know whether there any activity on supporting partial read/writes in
> arrow or fastparquet? That would change things a lot.
>
>
> > If doing the transform from/to various file formats is a feature you feel
> > strongly about, I would suggest doing the transforms via out-of-band ETL
> > jobs where the user can then request the files asynchronously later.
>
>
> That's what we were thinking about, yes. We need a 'core' format to store
> the data and write ETL jobs for, but secondary formats could be stored in
> S3 and returned on demand.
>
>
> > To your point of storing images with meta data such as tags. I haven’t
> > actually tried it but I suppose you could in theory write the images in
> one
> > Parquet binary type column and the tags in another.
> >
>
> Even then, there are different numbers of bounding boxes / tags per image.
> Can you store different-length vectors in Parquet?
>
>
> > Versioning is difficult and I believe there are many attempts at this
> right
> > now. DeltaLake for example has the ability to query at dataset at a point
> > in time. They basically have Parquet files with some extra json files on
> > the side describing the changes.
>
>
> I've looked at DeltaLake, but as far as I understand, its commit log
> depends on spark operations done on the dataframe? Hence, any change to the
> dataset has to be performed via spark? Is that correct?
>
>
> > Straight up versions of file could be achieved with your underlying file
> > system. S3 has file versioning.
> >
>
> Do you know of any tools to compute diffs between Parquet file? What I
> could find was basically: export both files to CSV and run git diff.
> DeltaLake would help here, but again, is seems that it only 'tracks' Spark
> operations done directly on the file?
>
> Thanks!
> Joaquin
>
> PS. Nick, would you like to be mentioned as a contributor in the blog post?
> Your comments helped a lot to improve it ;).
>
>
>
>
> On Tue,  Jun 30, 2020 at 6:46 AM Joaquin Vanschoren <j.vanscho...@tue.nl>
> > wrote:
> >
> > > Hi all,
> > >
> > > Sorry for restarting an old thread, but we've had a _lot_ of
> discussions
> > > over the past 9 months or so on how to store machine learning datasets
> > > internally. We've written a blog post about it and would love to hear
> > your
> > > thoughts:
> > >
> > >
> >
> https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
> > >
> > > To be clear: what we need is a data format for archival storage on the
> > > server, and preferably one that supports versioning/diff, multi-table
> > > storage, and sparse data.
> > > Hence, this is for *internal* storage. When OpenML users want to
> > download a
> > > dataset in parquet or arrow we can always convert it on the fly (or
> from
> > a
> > > cache). We already use Arrow/Feather to cache the datasets after it is
> > > downloaded (when possible).
> > >
> > > One specific concern about parquet is that we are not entirely sure
> > > whether a parquet file created by one parser (e.g. in R) can always be
> > read
> > > by another parser (e.g. in Python). We saw some github issues related
> to
> > > this but we don't know whether this is still an issue. Do you know?
> Also,
> > > it seems that none of the current python parsers support partial
> > > read/writes, is that correct?
> > >
> > > Because of these issues, we are still considering a text-based format
> > (e.g.
> > > CSV) for our main dataset storage, mainly because of its broad native
> > > support in all languages and easy versioning/diffs (we could use
> > git-lfs),
> > > and use parquet/arrow for later usage where possible. We're still
> > doubting
> > > between CSV and Parquet, though.
> > >
> > > Do you have any thoughts or comments?
> > >
> > > Thanks!
> > > Joaquin
> > >
> > > On Thu, 20 Jun 2019 at 23:47, Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > > > hi Joaquin -- there would be no practical difference, primarily it
> > > > would be for the preservation of APIs in Python and R related to the
> > > > Feather format. Internally "read_feather" will invoke the same code
> > > > paths as the Arrow protocol file reader
> > > >
> > > > - Wes
> > > >
> > > > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
> > > > <joaquin.vanscho...@gmail.com> wrote:
> > > > >
> > > > > Thank you all for your very detailed answers! I also read in other
> > > > threads
> > > > > that the 1.0.0 release might be coming somewhere this fall? I'm
> > really
> > > > > looking forward to that.
> > > > > @Wes: will there be any practical difference between Feather and
> > Arrow
> > > > > after the 1.0.0 release? It is just an alias? What would be the
> > > benefits
> > > > of
> > > > > using Feather rather than Arrow at that point?
> > > > >
> > > > > Thanks!
> > > > > Joaquin
> > > > >
> > > > >
> > > > >
> > > > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch>
> wrote:
> > > > >
> > > > > > hi there,
> > > > > >
> > > > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <
> > > emkornfi...@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > > *  Can Feather files already be read in Java/Go/C#/...?
> > > > > > >
> > > > > > > I don't know the status of feather.   The arrow file format
> > should
> > > be
> > > > > > > readable by Java and C++ (I believe all the languages that bind
> > C++
> > > > also
> > > > > > > support the format, these include python, ruby and R) .  A
> quick
> > > code
> > > > > > > search of the repo makes me think that there is also support
> for
> > > C#,
> > > > Rust
> > > > > > > and Javascript. It doesn't look like the file format isn't
> > > supported
> > > > in
> > > > > > Go
> > > > > > > yet but it probably wouldn't be too hard to do.
> > > > > > >
> > > > > > Go doesn't handle Feather files.
> > > > > > But there is support (not yet feature complete, see [1]) for
> Arrow
> > > > files
> > > > > > (r/w):
> > > > > > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
> > > > > >
> > > > > > hth,
> > > > > > -s
> > > > > >
> > > > > > [1]: https://issues.apache.org/jira/browse/ARROW-3679
> > > > > >
> > > >
> > >
> >
>

Re: Arrow as a common open standard for machine learning data

Reply via email to