Thanks!
> You should be able to store different length vectors in Parquet. Think of > strings simply as an array of bytes, and those are variable length. You > would want to make sure you don’t use DICTIONARY_ENCODING in that case. > Interesting. We'll look at that. > No, I'm not aware of any tools that do diffs between Parquet files. I'm > not sure how you could perform a byte for byte diff without reading one > into memory and decoding it. My question here would be who is trying to > consume the diff you want to generate? Is the diff something you want to > display to a user? i.e. column A, row 132 was "foo" but has now changed to > "bar" > Yes. A typical scenario is that there is a public dataset, and different people have made incremental improvements. This could be, for instance, removal of constant columns, fixing typos, formatting dates, remove data from a broken sensor,... It would be interesting if users could see how two datasets differ. Another scenario is a reviewing process where the author of a dataset wants to review changes made by a contributor before accepting them. > Or are you looking to apply an update to a dataset? i.e. I recently > trained and stored embeddings and now I need to update them but I don't > want to override the data because I would like to be able to retrieve what > they were in the last training iteration so I can roll back, run parallel > tests, etc.. > Possibly, although updating an embedding will likely change every value in the dataset. That seems to call for file versioning and meta-data about the process that generated it. > Thanks, you may mention me as a contributor to the blog post if you'd like! > Done ;). Thanks again, Joaquin > On Thu, Jul 2, 2020 at 9:40 AM Joaquin Vanschoren <j.vanscho...@tue.nl> > wrote: > >> Hi Nick, all, >> >> Thanks! I updated the blog post to specify the requirements better. >> >> First, we plan to store the datasets in S3 (on min.io). I agree this >> works >> nicely with Parquet. >> >> Do you know whether there any activity on supporting partial read/writes >> in >> arrow or fastparquet? That would change things a lot. >> >> >> > If doing the transform from/to various file formats is a feature you >> feel >> > strongly about, I would suggest doing the transforms via out-of-band ETL >> > jobs where the user can then request the files asynchronously later. >> >> >> That's what we were thinking about, yes. We need a 'core' format to store >> the data and write ETL jobs for, but secondary formats could be stored in >> S3 and returned on demand. >> >> >> > To your point of storing images with meta data such as tags. I haven’t >> > actually tried it but I suppose you could in theory write the images in >> one >> > Parquet binary type column and the tags in another. >> > >> >> Even then, there are different numbers of bounding boxes / tags per image. >> Can you store different-length vectors in Parquet? >> >> >> > Versioning is difficult and I believe there are many attempts at this >> right >> > now. DeltaLake for example has the ability to query at dataset at a >> point >> > in time. They basically have Parquet files with some extra json files on >> > the side describing the changes. >> >> >> I've looked at DeltaLake, but as far as I understand, its commit log >> depends on spark operations done on the dataframe? Hence, any change to >> the >> dataset has to be performed via spark? Is that correct? >> >> >> > Straight up versions of file could be achieved with your underlying file >> > system. S3 has file versioning. >> > >> >> Do you know of any tools to compute diffs between Parquet file? What I >> could find was basically: export both files to CSV and run git diff. >> DeltaLake would help here, but again, is seems that it only 'tracks' Spark >> operations done directly on the file? >> >> Thanks! >> Joaquin >> >> PS. Nick, would you like to be mentioned as a contributor in the blog >> post? >> Your comments helped a lot to improve it ;). >> >> >> >> >> On Tue, Jun 30, 2020 at 6:46 AM Joaquin Vanschoren <j.vanscho...@tue.nl> >> > wrote: >> > >> > > Hi all, >> > > >> > > Sorry for restarting an old thread, but we've had a _lot_ of >> discussions >> > > over the past 9 months or so on how to store machine learning datasets >> > > internally. We've written a blog post about it and would love to hear >> > your >> > > thoughts: >> > > >> > > >> > >> https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html >> > > >> > > To be clear: what we need is a data format for archival storage on the >> > > server, and preferably one that supports versioning/diff, multi-table >> > > storage, and sparse data. >> > > Hence, this is for *internal* storage. When OpenML users want to >> > download a >> > > dataset in parquet or arrow we can always convert it on the fly (or >> from >> > a >> > > cache). We already use Arrow/Feather to cache the datasets after it is >> > > downloaded (when possible). >> > > >> > > One specific concern about parquet is that we are not entirely sure >> > > whether a parquet file created by one parser (e.g. in R) can always be >> > read >> > > by another parser (e.g. in Python). We saw some github issues related >> to >> > > this but we don't know whether this is still an issue. Do you know? >> Also, >> > > it seems that none of the current python parsers support partial >> > > read/writes, is that correct? >> > > >> > > Because of these issues, we are still considering a text-based format >> > (e.g. >> > > CSV) for our main dataset storage, mainly because of its broad native >> > > support in all languages and easy versioning/diffs (we could use >> > git-lfs), >> > > and use parquet/arrow for later usage where possible. We're still >> > doubting >> > > between CSV and Parquet, though. >> > > >> > > Do you have any thoughts or comments? >> > > >> > > Thanks! >> > > Joaquin >> > > >> > > On Thu, 20 Jun 2019 at 23:47, Wes McKinney <wesmck...@gmail.com> >> wrote: >> > > >> > > > hi Joaquin -- there would be no practical difference, primarily it >> > > > would be for the preservation of APIs in Python and R related to the >> > > > Feather format. Internally "read_feather" will invoke the same code >> > > > paths as the Arrow protocol file reader >> > > > >> > > > - Wes >> > > > >> > > > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren >> > > > <joaquin.vanscho...@gmail.com> wrote: >> > > > > >> > > > > Thank you all for your very detailed answers! I also read in other >> > > > threads >> > > > > that the 1.0.0 release might be coming somewhere this fall? I'm >> > really >> > > > > looking forward to that. >> > > > > @Wes: will there be any practical difference between Feather and >> > Arrow >> > > > > after the 1.0.0 release? It is just an alias? What would be the >> > > benefits >> > > > of >> > > > > using Feather rather than Arrow at that point? >> > > > > >> > > > > Thanks! >> > > > > Joaquin >> > > > > >> > > > > >> > > > > >> > > > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch> >> wrote: >> > > > > >> > > > > > hi there, >> > > > > > >> > > > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield < >> > > emkornfi...@gmail.com >> > > > > >> > > > > > wrote: >> > > > > > >> > > > > > > > * Can Feather files already be read in Java/Go/C#/...? >> > > > > > > >> > > > > > > I don't know the status of feather. The arrow file format >> > should >> > > be >> > > > > > > readable by Java and C++ (I believe all the languages that >> bind >> > C++ >> > > > also >> > > > > > > support the format, these include python, ruby and R) . A >> quick >> > > code >> > > > > > > search of the repo makes me think that there is also support >> for >> > > C#, >> > > > Rust >> > > > > > > and Javascript. It doesn't look like the file format isn't >> > > supported >> > > > in >> > > > > > Go >> > > > > > > yet but it probably wouldn't be too hard to do. >> > > > > > > >> > > > > > Go doesn't handle Feather files. >> > > > > > But there is support (not yet feature complete, see [1]) for >> Arrow >> > > > files >> > > > > > (r/w): >> > > > > > - https://godoc.org/github.com/apache/arrow/go/arrow/ipc >> > > > > > >> > > > > > hth, >> > > > > > -s >> > > > > > >> > > > > > [1]: https://issues.apache.org/jira/browse/ARROW-3679 >> > > > > > >> > > > >> > > >> > >> >