On Tue, Jun 30, 2020 at 8:09 AM Nicholas Poorman <nickpoor...@gmail.com> wrote: > > Joaquin, > > After reading your proposal I think there may be some things you may want > to consider. > > It sounds like you are trying to come up with a one size fits all solution > but it may be better to define your requirements based on your needs and > environment. > > For starters, where do you plan to store these files? Do you plan on > putting them in a cloud object storage like S3 or do you plan on having > disk volumes attached to servers you are managing? A format like Parquet is > going to be useful for object storage such as S3 because you bundle up > everything in memory and then write it out at once. Any append-able format > is going to require disk volumes where you have append functionality. There > are a few active project that build commit logs and tombstones on top of > Parquet to add this functionality. For example Hudi and Databricks > DeltaLake. Also, if you plan on doing anything at scale you might run into > issues with lock contention if you choose something backed by b-trees such > as SQLite. > > There’s two ways you could handle the issues with Parquet implementations > that are currently unable to read partial files. One, you could contribute > back to the Parquet implementation so that it is capable of doing so. Or > two, you could partition your Parquet files and write them in smaller > chunks so they could be selectively read. I’m currently in the process of > implementing the Parquet implementation in Go so partial read functionality > is something I will consider. > > If you plan on having a service that can read in one format and return > another format to a user, you are either going to need a format capable of > stream decoding/encoding or a whole lot of memory on the instances running > the service. Something like csv would allow for stream decoding/encoding. > Compression algorithms are for the most part going to be streaming. A > b-tree is going to be streaming as you can do something like a > breadth-first iteration over it. Parquet on the other hand is either going > to require you to write files in small partitions (this is generally bad > and Spark users refer to this as the “small files problem”), or you will > need to utilize an implementation that supports partial reads. There are > Parquet implementations in Java and Go that support partial reads. The > issue you will face is doing the steaming writes back to the user. If for > example a user wanted their data returned as Parquet you would have to do > the transformation in memory all at once and then stream it to the user. If > doing the transform from/to various file formats is a feature you feel > strongly about, I would suggest doing the transforms via out-of-band ETL > jobs where the user can then request the files asynchronously later. Doing > the transform in-band of the request / response lifecycle doesn’t seem > scalable given the constraints of some file formats and instance memory. > > To your point of storing images with meta data such as tags. I haven’t > actually tried it but I suppose you could in theory write the images in one > Parquet binary type column and the tags in another. > > Versioning is difficult and I believe there are many attempts at this right > now. DeltaLake for example has the ability to query at dataset at a point > in time. They basically have Parquet files with some extra json files on > the side describing the changes. You first read the json files to > understand the changes and then read the Parquet files they reference. > Straight up versions of file could be achieved with your underlying file > system. S3 has file versioning, Docker has its own internal delta changes > file system layer, etc.. > > I would not recommend storing the files in Feather for long term storage as > your file size and costs are going to explode compared to a column-oriented > format that supports compression.
Note: Feather files support compression now. > Best, > Nick > > > On Tue, Jun 30, 2020 at 6:46 AM Joaquin Vanschoren <j.vanscho...@tue.nl> > wrote: > > > Hi all, > > > > Sorry for restarting an old thread, but we've had a _lot_ of discussions > > over the past 9 months or so on how to store machine learning datasets > > internally. We've written a blog post about it and would love to hear your > > thoughts: > > > > https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html > > > > To be clear: what we need is a data format for archival storage on the > > server, and preferably one that supports versioning/diff, multi-table > > storage, and sparse data. > > Hence, this is for *internal* storage. When OpenML users want to download a > > dataset in parquet or arrow we can always convert it on the fly (or from a > > cache). We already use Arrow/Feather to cache the datasets after it is > > downloaded (when possible). > > > > One specific concern about parquet is that we are not entirely sure > > whether a parquet file created by one parser (e.g. in R) can always be read > > by another parser (e.g. in Python). We saw some github issues related to > > this but we don't know whether this is still an issue. Do you know? Also, > > it seems that none of the current python parsers support partial > > read/writes, is that correct? > > > > Because of these issues, we are still considering a text-based format (e.g. > > CSV) for our main dataset storage, mainly because of its broad native > > support in all languages and easy versioning/diffs (we could use git-lfs), > > and use parquet/arrow for later usage where possible. We're still doubting > > between CSV and Parquet, though. > > > > Do you have any thoughts or comments? > > > > Thanks! > > Joaquin > > > > On Thu, 20 Jun 2019 at 23:47, Wes McKinney <wesmck...@gmail.com> wrote: > > > > > hi Joaquin -- there would be no practical difference, primarily it > > > would be for the preservation of APIs in Python and R related to the > > > Feather format. Internally "read_feather" will invoke the same code > > > paths as the Arrow protocol file reader > > > > > > - Wes > > > > > > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren > > > <joaquin.vanscho...@gmail.com> wrote: > > > > > > > > Thank you all for your very detailed answers! I also read in other > > > threads > > > > that the 1.0.0 release might be coming somewhere this fall? I'm really > > > > looking forward to that. > > > > @Wes: will there be any practical difference between Feather and Arrow > > > > after the 1.0.0 release? It is just an alias? What would be the > > benefits > > > of > > > > using Feather rather than Arrow at that point? > > > > > > > > Thanks! > > > > Joaquin > > > > > > > > > > > > > > > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch> wrote: > > > > > > > > > hi there, > > > > > > > > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield < > > emkornfi...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > > * Can Feather files already be read in Java/Go/C#/...? > > > > > > > > > > > > I don't know the status of feather. The arrow file format should > > be > > > > > > readable by Java and C++ (I believe all the languages that bind C++ > > > also > > > > > > support the format, these include python, ruby and R) . A quick > > code > > > > > > search of the repo makes me think that there is also support for > > C#, > > > Rust > > > > > > and Javascript. It doesn't look like the file format isn't > > supported > > > in > > > > > Go > > > > > > yet but it probably wouldn't be too hard to do. > > > > > > > > > > > Go doesn't handle Feather files. > > > > > But there is support (not yet feature complete, see [1]) for Arrow > > > files > > > > > (r/w): > > > > > - https://godoc.org/github.com/apache/arrow/go/arrow/ipc > > > > > > > > > > hth, > > > > > -s > > > > > > > > > > [1]: https://issues.apache.org/jira/browse/ARROW-3679 > > > > > > > > > >