Hello Uwe, (messed up with the mailing list settings, sorry if this message shows up as not part of the original thread)
Created a PR for the "Powered by" - thanks for the suggestion! We persist the schema into the Parquet metadata, as a custom field. We are currently working on a version that would use Arrow tables as the primary data storage (previously we were doing py_dict very early in ourdata flow and working with python/numpy types). I am still catching up on Arrow data structure, and maybe you can shed some light on it. I did not find a way to create an Array of pa.Tensor's (imagine a rowgroup of images). As a result I end up keeping the data as array of lists and utilize side channels for transmitting the shapes, which makes the code more clunky. Next step would be to stream tensors directly to Tensorflow directly from Arrow tables. I guess a native support of Tensors there could help. Regards, - Yevgeni > ---------- Forwarded message ---------- > From: "Uwe L. Korn" <uw...@xhochy.com> > To: dev@arrow.apache.org > Cc: > Bcc: > Date: Fri, 05 Oct 2018 17:38:13 +0200 > Subject: Re: Petastorm: PyArrow based library for Tensorflow, PyTorch and others... > > Hello Yevgeni, > > this looks interesting. Can you make a PR to https://github.com/apache/arrow so that Petastorm is listed on https://arrow.apache.org/powered_by/ ? > > I browsed a bit through your code. As far as I can see your approach is store to have a set of Parquet files in a directory with a schema that can be > translated for Spark, Tensorflow, Torch, … Is this schema persisted in the Parquet file metadata or as a separate file alongside the dataset? Could > we extend Arrow's type system a bit to better suit all the frameworks you are targeting. As you had to build a more general schema class, I guess > there are definitely things that could not be expressed in Arrow's schema definition. Not sure whether we could extend pyarrow's schema classes to > fully support your use case but I would like to understand how to better support it. > > Uwe > > On Wed, Sep 26, 2018, at 8:59 PM, Yevgeni Litvin wrote: > > Hi, > > > > My name is Yevgeni Litvin. I am working on ML infra with a small team > > within Uber ATG. Our team has recently open sourced Petastorm library. It > > heavily relies on Apache Arrow so I wanted to share it with the community. > > > > The goal of the project is to provide a convenient way for deep learning > > community to use Apache Parquet store with sensor data from Tensorflow, > > PyTorch or other Python based ML frameworks. > > > > I believe our use of Parquet is different from mainstream applications as > > our field sizes are asymetric (some are huge, such as images, and others > > are small) and rowgroup sizes are relatively small (<100). That required > > some adaptations. > > > > We use PyArrow mostly for loading the data. We do see great potential for > > further optimizations and speedups by relying more heavily on Arrow as > > in-memory store. > > > > You can find more information about our project here: > > > > http://eng.uber.com/petastorm/ > > https://github.com/uber/petastorm > > > > Would be more than happy to hear comments, feedback and suggestions! > > > > Thank you, > > > > - Yevgeni Litvin >