Re: Arrow as a common open standard for machine learning data

Nicholas Poorman Tue, 30 Jun 2020 06:09:58 -0700

Joaquin,

After reading your proposal I think there may be some things you may want
to consider.

It sounds like you are trying to come up with a one size fits all solution
but it may be better to define your requirements based on your needs and
environment.

For starters, where do you plan to store these files? Do you plan on
putting them in a cloud object storage like S3 or do you plan on having
disk volumes attached to servers you are managing? A format like Parquet is
going to be useful for object storage such as S3 because you bundle up
everything in memory and then write it out at once. Any append-able format
is going to require disk volumes where you have append functionality. There
are a few active project that build commit logs and tombstones on top of
Parquet to add this functionality. For example Hudi and Databricks
DeltaLake. Also, if you plan on doing anything at scale you might run into
issues with lock contention if you choose something backed by b-trees such
as SQLite.

There’s two ways you could handle the issues with Parquet implementations
that are currently unable to read partial files. One, you could contribute
back to the Parquet implementation so that it is capable of doing so. Or
two, you could partition your Parquet files and write them in smaller
chunks so they could be selectively read. I’m currently in the process of
implementing the Parquet implementation in Go so partial read functionality
is something I will consider.

If you plan on having a service that can read in one format and return
another format to a user, you are either going to need a format capable of
stream decoding/encoding or a whole lot of memory on the instances running
the service. Something like csv would allow for stream decoding/encoding.
Compression algorithms are for the most part going to be streaming. A
b-tree is going to be streaming as you can do something like a
breadth-first iteration over it. Parquet on the other hand is either going
to require you to write files in small partitions (this is generally bad
and Spark users refer to this as the “small files problem”), or you will
need to utilize an implementation that supports partial reads. There are
Parquet implementations in Java and Go that support partial reads. The
issue you will face is doing the steaming writes back to the user. If for
example a user wanted their data returned as Parquet you would have to do
the transformation in memory all at once and then stream it to the user. If
doing the transform from/to various file formats is a feature you feel
strongly about, I would suggest doing the transforms via out-of-band ETL
jobs where the user can then request the files asynchronously later. Doing
the transform in-band of the request / response lifecycle doesn’t seem
scalable given the constraints of some file formats and instance memory.

To your point of storing images with meta data such as tags. I haven’t
actually tried it but I suppose you could in theory write the images in one
Parquet binary type column and the tags in another.

Versioning is difficult and I believe there are many attempts at this right
now. DeltaLake for example has the ability to query at dataset at a point
in time. They basically have Parquet files with some extra json files on
the side describing the changes. You first read the json files to
understand the changes and then read the Parquet files they reference.
Straight up versions of file could be achieved with your underlying file
system. S3 has file versioning, Docker has its own internal delta changes
file system layer, etc..

I would not recommend storing the files in Feather for long term storage as
your file size and costs are going to explode compared to a column-oriented
format that supports compression.

Best,
Nick

On Tue, Jun 30, 2020 at 6:46 AM Joaquin Vanschoren <j.vanscho...@tue.nl>
wrote:

> Hi all,
>
> Sorry for restarting an old thread, but we've had a _lot_ of discussions
> over the past 9 months or so on how to store machine learning datasets
> internally. We've written a blog post about it and would love to hear your
> thoughts:
>
> https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
>
> To be clear: what we need is a data format for archival storage on the
> server, and preferably one that supports versioning/diff, multi-table
> storage, and sparse data.
> Hence, this is for *internal* storage. When OpenML users want to download a
> dataset in parquet or arrow we can always convert it on the fly (or from a
> cache). We already use Arrow/Feather to cache the datasets after it is
> downloaded (when possible).
>
> One specific concern about parquet is that we are not entirely sure
> whether a parquet file created by one parser (e.g. in R) can always be read
> by another parser (e.g. in Python). We saw some github issues related to
> this but we don't know whether this is still an issue. Do you know? Also,
> it seems that none of the current python parsers support partial
> read/writes, is that correct?
>
> Because of these issues, we are still considering a text-based format (e.g.
> CSV) for our main dataset storage, mainly because of its broad native
> support in all languages and easy versioning/diffs (we could use git-lfs),
> and use parquet/arrow for later usage where possible. We're still doubting
> between CSV and Parquet, though.
>
> Do you have any thoughts or comments?
>
> Thanks!
> Joaquin
>
> On Thu, 20 Jun 2019 at 23:47, Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi Joaquin -- there would be no practical difference, primarily it
> > would be for the preservation of APIs in Python and R related to the
> > Feather format. Internally "read_feather" will invoke the same code
> > paths as the Arrow protocol file reader
> >
> > - Wes
> >
> > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
> > <joaquin.vanscho...@gmail.com> wrote:
> > >
> > > Thank you all for your very detailed answers! I also read in other
> > threads
> > > that the 1.0.0 release might be coming somewhere this fall? I'm really
> > > looking forward to that.
> > > @Wes: will there be any practical difference between Feather and Arrow
> > > after the 1.0.0 release? It is just an alias? What would be the
> benefits
> > of
> > > using Feather rather than Arrow at that point?
> > >
> > > Thanks!
> > > Joaquin
> > >
> > >
> > >
> > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch> wrote:
> > >
> > > > hi there,
> > > >
> > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <
> emkornfi...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > > *  Can Feather files already be read in Java/Go/C#/...?
> > > > >
> > > > > I don't know the status of feather.   The arrow file format should
> be
> > > > > readable by Java and C++ (I believe all the languages that bind C++
> > also
> > > > > support the format, these include python, ruby and R) .  A quick
> code
> > > > > search of the repo makes me think that there is also support for
> C#,
> > Rust
> > > > > and Javascript. It doesn't look like the file format isn't
> supported
> > in
> > > > Go
> > > > > yet but it probably wouldn't be too hard to do.
> > > > >
> > > > Go doesn't handle Feather files.
> > > > But there is support (not yet feature complete, see [1]) for Arrow
> > files
> > > > (r/w):
> > > > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
> > > >
> > > > hth,
> > > > -s
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/ARROW-3679
> > > >
> >
>

Re: Arrow as a common open standard for machine learning data

Reply via email to