Hello Yevgeni,

this looks interesting. Can you make a PR to https://github.com/apache/arrow so 
that  Petastorm is listed on https://arrow.apache.org/powered_by/ ? 

I browsed a bit through your code. As far as I can see your approach is store 
to have a set of Parquet files in a directory with a schema that can be 
translated for Spark, Tensorflow, Torch, … Is this schema persisted in the 
Parquet file metadata or as a separate file alongside the dataset? Could we 
extend Arrow's type system a bit to better suit all the frameworks you are 
targeting. As you had to build a more general schema class, I guess there are 
definitely things that could not be expressed in Arrow's schema definition. Not 
sure whether we could extend pyarrow's schema classes to fully support your use 
case but I would like to understand how to better support it.

Uwe

On Wed, Sep 26, 2018, at 8:59 PM, Yevgeni Litvin wrote:
> Hi,
> 
> My name is Yevgeni Litvin. I am working on ML infra with a small team
> within Uber ATG. Our team has recently open sourced Petastorm library. It
> heavily relies on Apache Arrow so I wanted to share it with the community.
> 
> The goal of the project is to provide a convenient way for deep learning
> community to use Apache Parquet store with sensor data from Tensorflow,
> PyTorch or other Python based ML frameworks.
> 
> I believe our use of Parquet is different from mainstream applications as
> our field sizes are asymetric (some are huge, such as images, and others
> are small) and rowgroup sizes are relatively small (<100). That required
> some adaptations.
> 
> We use PyArrow mostly for loading the data. We do see great potential for
> further optimizations and speedups by relying more heavily on Arrow as
> in-memory store.
> 
> You can find more information about our project here:
> 
> http://eng.uber.com/petastorm/
> https://github.com/uber/petastorm
> 
> Would be more than happy to hear comments, feedback and suggestions!
> 
> Thank you,
> 
> - Yevgeni Litvin

Reply via email to