Re: Arrow as a common open standard for machine learning data

Wes McKinney Tue, 30 Jun 2020 06:37:58 -0700

On Tue, Jun 30, 2020 at 8:09 AM Nicholas Poorman <nickpoor...@gmail.com> wrote:
>
> Joaquin,
>
> After reading your proposal I think there may be some things you may want
> to consider.
>
> It sounds like you are trying to come up with a one size fits all solution
> but it may be better to define your requirements based on your needs and
> environment.
>
> For starters, where do you plan to store these files? Do you plan on
> putting them in a cloud object storage like S3 or do you plan on having
> disk volumes attached to servers you are managing? A format like Parquet is
> going to be useful for object storage such as S3 because you bundle up
> everything in memory and then write it out at once. Any append-able format
> is going to require disk volumes where you have append functionality. There
> are a few active project that build commit logs and tombstones on top of
> Parquet to add this functionality. For example Hudi and Databricks
> DeltaLake. Also, if you plan on doing anything at scale you might run into
> issues with lock contention if you choose something backed by b-trees such
> as SQLite.
>
> There’s two ways you could handle the issues with Parquet implementations
> that are currently unable to read partial files. One, you could contribute
> back to the Parquet implementation so that it is capable of doing so. Or
> two, you could partition your Parquet files and write them in smaller
> chunks so they could be selectively read. I’m currently in the process of
> implementing the Parquet implementation in Go so partial read functionality
> is something I will consider.
>
> If you plan on having a service that can read in one format and return
> another format to a user, you are either going to need a format capable of
> stream decoding/encoding or a whole lot of memory on the instances running
> the service. Something like csv would allow for stream decoding/encoding.
> Compression algorithms are for the most part going to be streaming. A
> b-tree is going to be streaming as you can do something like a
> breadth-first iteration over it. Parquet on the other hand is either going
> to require you to write files in small partitions (this is generally bad
> and Spark users refer to this as the “small files problem”), or you will
> need to utilize an implementation that supports partial reads. There are
> Parquet implementations in Java and Go that support partial reads. The
> issue you will face is doing the steaming writes back to the user. If for
> example a user wanted their data returned as Parquet you would have to do
> the transformation in memory all at once and then stream it to the user. If
> doing the transform from/to various file formats is a feature you feel
> strongly about, I would suggest doing the transforms via out-of-band ETL
> jobs where the user can then request the files asynchronously later. Doing
> the transform in-band of the request / response lifecycle doesn’t seem
> scalable given the constraints of some file formats and instance memory.
>
> To your point of storing images with meta data such as tags. I haven’t
> actually tried it but I suppose you could in theory write the images in one
> Parquet binary type column and the tags in another.
>
> Versioning is difficult and I believe there are many attempts at this right
> now. DeltaLake for example has the ability to query at dataset at a point
> in time. They basically have Parquet files with some extra json files on
> the side describing the changes. You first read the json files to
> understand the changes and then read the Parquet files they reference.
> Straight up versions of file could be achieved with your underlying file
> system. S3 has file versioning, Docker has its own internal delta changes
> file system layer, etc..
>
> I would not recommend storing the files in Feather for long term storage as
> your file size and costs are going to explode compared to a column-oriented
> format that supports compression.


Note: Feather files support compression now.

> Best,
> Nick
>
>
> On Tue, Jun 30, 2020 at 6:46 AM Joaquin Vanschoren <j.vanscho...@tue.nl>
> wrote:
>
> > Hi all,
> >
> > Sorry for restarting an old thread, but we've had a _lot_ of discussions
> > over the past 9 months or so on how to store machine learning datasets
> > internally. We've written a blog post about it and would love to hear your
> > thoughts:
> >
> > https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
> >
> > To be clear: what we need is a data format for archival storage on the
> > server, and preferably one that supports versioning/diff, multi-table
> > storage, and sparse data.
> > Hence, this is for *internal* storage. When OpenML users want to download a
> > dataset in parquet or arrow we can always convert it on the fly (or from a
> > cache). We already use Arrow/Feather to cache the datasets after it is
> > downloaded (when possible).
> >
> > One specific concern about parquet is that we are not entirely sure
> > whether a parquet file created by one parser (e.g. in R) can always be read
> > by another parser (e.g. in Python). We saw some github issues related to
> > this but we don't know whether this is still an issue. Do you know? Also,
> > it seems that none of the current python parsers support partial
> > read/writes, is that correct?
> >
> > Because of these issues, we are still considering a text-based format (e.g.
> > CSV) for our main dataset storage, mainly because of its broad native
> > support in all languages and easy versioning/diffs (we could use git-lfs),
> > and use parquet/arrow for later usage where possible. We're still doubting
> > between CSV and Parquet, though.
> >
> > Do you have any thoughts or comments?
> >
> > Thanks!
> > Joaquin
> >
> > On Thu, 20 Jun 2019 at 23:47, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > hi Joaquin -- there would be no practical difference, primarily it
> > > would be for the preservation of APIs in Python and R related to the
> > > Feather format. Internally "read_feather" will invoke the same code
> > > paths as the Arrow protocol file reader
> > >
> > > - Wes
> > >
> > > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
> > > <joaquin.vanscho...@gmail.com> wrote:
> > > >
> > > > Thank you all for your very detailed answers! I also read in other
> > > threads
> > > > that the 1.0.0 release might be coming somewhere this fall? I'm really
> > > > looking forward to that.
> > > > @Wes: will there be any practical difference between Feather and Arrow
> > > > after the 1.0.0 release? It is just an alias? What would be the
> > benefits
> > > of
> > > > using Feather rather than Arrow at that point?
> > > >
> > > > Thanks!
> > > > Joaquin
> > > >
> > > >
> > > >
> > > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch> wrote:
> > > >
> > > > > hi there,
> > > > >
> > > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <
> > emkornfi...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > > *  Can Feather files already be read in Java/Go/C#/...?
> > > > > >
> > > > > > I don't know the status of feather.   The arrow file format should
> > be
> > > > > > readable by Java and C++ (I believe all the languages that bind C++
> > > also
> > > > > > support the format, these include python, ruby and R) .  A quick
> > code
> > > > > > search of the repo makes me think that there is also support for
> > C#,
> > > Rust
> > > > > > and Javascript. It doesn't look like the file format isn't
> > supported
> > > in
> > > > > Go
> > > > > > yet but it probably wouldn't be too hard to do.
> > > > > >
> > > > > Go doesn't handle Feather files.
> > > > > But there is support (not yet feature complete, see [1]) for Arrow
> > > files
> > > > > (r/w):
> > > > > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
> > > > >
> > > > > hth,
> > > > > -s
> > > > >
> > > > > [1]: https://issues.apache.org/jira/browse/ARROW-3679
> > > > >
> > >
> >

Re: Arrow as a common open standard for machine learning data

Reply via email to