hi all,

I just put together a document to help with creating and organizing
JIRA issues related to the Datasets project that we've been discussing
over the last 6 months

https://docs.google.com/document/d/1QOuz_6rIUskM0Dcxk5NwP8KhKn_qK6o_rFV3fbHQ_AM/edit?usp=sharing

I've left out work relating to expanding filesystem support, such as
S3, GCS, and Azure -- since we have a general purpose filesystem API
now, the initial Datasets implementation work need not be coupled to
implementing new filesystems (though some optimizations or options may
be required to improve performance for systems like S3 that have a lot
different performance than local disk).

One concrete goal of this is to port Parquet-specific Dataset logic in
pyarrow/parquet.py into C++ so that we can have feature parity around
this in Python, R, and Ruby. Similarly, we wish to make this logic not
Parquet-specific so we can also deal with JSON, CSV, ORC, and later
Avro files.

I know there are a number of people interested in this project, so I
don't want to get in anyone's way. I'm tied up with other work this
month at least so I likely won't be able to write any patches for this
until September at the earliest. I'll be glad to give edit access to
anyone who finds this document helpful and wants to add to it (e.g.
JIRA links).

Thanks,
Wes

Reply via email to