Re: Arrow as a common open standard for machine learning data

Joaquin Vanschoren Thu, 02 Jul 2020 14:46:45 -0700

Thanks!


> You should be able to store different length vectors in Parquet. Think of
> strings simply as an array of bytes, and those are variable length. You
> would want to make sure you don’t use DICTIONARY_ENCODING in that case.
>

Interesting. We'll look at that.


> No, I'm not aware of any tools that do diffs between Parquet files. I'm
> not sure how you could perform a byte for byte diff without reading one
> into memory and decoding it. My question here would be who is trying to
> consume the diff you want to generate? Is the diff something you want to
> display to a user? i.e. column A, row 132 was "foo" but has now changed to
> "bar"
>

Yes. A typical scenario is that there is a public dataset, and
different people have made incremental improvements. This could be, for
instance, removal of constant columns, fixing typos, formatting dates,
remove data from a broken sensor,... It would be interesting if users could
see how two datasets differ.
Another scenario is a reviewing process where the author of a dataset wants
to review changes made by a contributor before accepting them.


> Or are you looking to apply an update to a dataset? i.e. I recently
> trained and stored embeddings and now I need to update them but I don't
> want to override the data because I would like to be able to retrieve what
> they were in the last training iteration so I can roll back, run parallel
> tests, etc..
>

Possibly, although updating an embedding will likely change every value in
the dataset. That seems to call for file versioning and meta-data about the
process that generated it.


> Thanks, you may mention me as a contributor to the blog post if you'd like!
>

Done ;).

Thanks again,
Joaquin





> On Thu, Jul 2, 2020 at 9:40 AM Joaquin Vanschoren <j.vanscho...@tue.nl>
> wrote:
>
>> Hi Nick, all,
>>
>> Thanks! I updated the blog post to specify the requirements better.
>>
>> First, we plan to store the datasets in S3 (on min.io). I agree this
>> works
>> nicely with Parquet.
>>
>> Do you know whether there any activity on supporting partial read/writes
>> in
>> arrow or fastparquet? That would change things a lot.
>>
>>
>> > If doing the transform from/to various file formats is a feature you
>> feel
>> > strongly about, I would suggest doing the transforms via out-of-band ETL
>> > jobs where the user can then request the files asynchronously later.
>>
>>
>> That's what we were thinking about, yes. We need a 'core' format to store
>> the data and write ETL jobs for, but secondary formats could be stored in
>> S3 and returned on demand.
>>
>>
>> > To your point of storing images with meta data such as tags. I haven’t
>> > actually tried it but I suppose you could in theory write the images in
>> one
>> > Parquet binary type column and the tags in another.
>> >
>>
>> Even then, there are different numbers of bounding boxes / tags per image.
>> Can you store different-length vectors in Parquet?
>>
>>
>> > Versioning is difficult and I believe there are many attempts at this
>> right
>> > now. DeltaLake for example has the ability to query at dataset at a
>> point
>> > in time. They basically have Parquet files with some extra json files on
>> > the side describing the changes.
>>
>>
>> I've looked at DeltaLake, but as far as I understand, its commit log
>> depends on spark operations done on the dataframe? Hence, any change to
>> the
>> dataset has to be performed via spark? Is that correct?
>>
>>
>> > Straight up versions of file could be achieved with your underlying file
>> > system. S3 has file versioning.
>> >
>>
>> Do you know of any tools to compute diffs between Parquet file? What I
>> could find was basically: export both files to CSV and run git diff.
>> DeltaLake would help here, but again, is seems that it only 'tracks' Spark
>> operations done directly on the file?
>>
>> Thanks!
>> Joaquin
>>
>> PS. Nick, would you like to be mentioned as a contributor in the blog
>> post?
>> Your comments helped a lot to improve it ;).
>>
>>
>>
>>
>> On Tue,  Jun 30, 2020 at 6:46 AM Joaquin Vanschoren <j.vanscho...@tue.nl>
>> > wrote:
>> >
>> > > Hi all,
>> > >
>> > > Sorry for restarting an old thread, but we've had a _lot_ of
>> discussions
>> > > over the past 9 months or so on how to store machine learning datasets
>> > > internally. We've written a blog post about it and would love to hear
>> > your
>> > > thoughts:
>> > >
>> > >
>> >
>> https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
>> > >
>> > > To be clear: what we need is a data format for archival storage on the
>> > > server, and preferably one that supports versioning/diff, multi-table
>> > > storage, and sparse data.
>> > > Hence, this is for *internal* storage. When OpenML users want to
>> > download a
>> > > dataset in parquet or arrow we can always convert it on the fly (or
>> from
>> > a
>> > > cache). We already use Arrow/Feather to cache the datasets after it is
>> > > downloaded (when possible).
>> > >
>> > > One specific concern about parquet is that we are not entirely sure
>> > > whether a parquet file created by one parser (e.g. in R) can always be
>> > read
>> > > by another parser (e.g. in Python). We saw some github issues related
>> to
>> > > this but we don't know whether this is still an issue. Do you know?
>> Also,
>> > > it seems that none of the current python parsers support partial
>> > > read/writes, is that correct?
>> > >
>> > > Because of these issues, we are still considering a text-based format
>> > (e.g.
>> > > CSV) for our main dataset storage, mainly because of its broad native
>> > > support in all languages and easy versioning/diffs (we could use
>> > git-lfs),
>> > > and use parquet/arrow for later usage where possible. We're still
>> > doubting
>> > > between CSV and Parquet, though.
>> > >
>> > > Do you have any thoughts or comments?
>> > >
>> > > Thanks!
>> > > Joaquin
>> > >
>> > > On Thu, 20 Jun 2019 at 23:47, Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> > >
>> > > > hi Joaquin -- there would be no practical difference, primarily it
>> > > > would be for the preservation of APIs in Python and R related to the
>> > > > Feather format. Internally "read_feather" will invoke the same code
>> > > > paths as the Arrow protocol file reader
>> > > >
>> > > > - Wes
>> > > >
>> > > > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
>> > > > <joaquin.vanscho...@gmail.com> wrote:
>> > > > >
>> > > > > Thank you all for your very detailed answers! I also read in other
>> > > > threads
>> > > > > that the 1.0.0 release might be coming somewhere this fall? I'm
>> > really
>> > > > > looking forward to that.
>> > > > > @Wes: will there be any practical difference between Feather and
>> > Arrow
>> > > > > after the 1.0.0 release? It is just an alias? What would be the
>> > > benefits
>> > > > of
>> > > > > using Feather rather than Arrow at that point?
>> > > > >
>> > > > > Thanks!
>> > > > > Joaquin
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch>
>> wrote:
>> > > > >
>> > > > > > hi there,
>> > > > > >
>> > > > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <
>> > > emkornfi...@gmail.com
>> > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > > *  Can Feather files already be read in Java/Go/C#/...?
>> > > > > > >
>> > > > > > > I don't know the status of feather.   The arrow file format
>> > should
>> > > be
>> > > > > > > readable by Java and C++ (I believe all the languages that
>> bind
>> > C++
>> > > > also
>> > > > > > > support the format, these include python, ruby and R) .  A
>> quick
>> > > code
>> > > > > > > search of the repo makes me think that there is also support
>> for
>> > > C#,
>> > > > Rust
>> > > > > > > and Javascript. It doesn't look like the file format isn't
>> > > supported
>> > > > in
>> > > > > > Go
>> > > > > > > yet but it probably wouldn't be too hard to do.
>> > > > > > >
>> > > > > > Go doesn't handle Feather files.
>> > > > > > But there is support (not yet feature complete, see [1]) for
>> Arrow
>> > > > files
>> > > > > > (r/w):
>> > > > > > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
>> > > > > >
>> > > > > > hth,
>> > > > > > -s
>> > > > > >
>> > > > > > [1]: https://issues.apache.org/jira/browse/ARROW-3679
>> > > > > >
>> > > >
>> > >
>> >
>>
>

Re: Arrow as a common open standard for machine learning data

Reply via email to