hi all, I just put together a document to help with creating and organizing JIRA issues related to the Datasets project that we've been discussing over the last 6 months
https://docs.google.com/document/d/1QOuz_6rIUskM0Dcxk5NwP8KhKn_qK6o_rFV3fbHQ_AM/edit?usp=sharing I've left out work relating to expanding filesystem support, such as S3, GCS, and Azure -- since we have a general purpose filesystem API now, the initial Datasets implementation work need not be coupled to implementing new filesystems (though some optimizations or options may be required to improve performance for systems like S3 that have a lot different performance than local disk). One concrete goal of this is to port Parquet-specific Dataset logic in pyarrow/parquet.py into C++ so that we can have feature parity around this in Python, R, and Ruby. Similarly, we wish to make this logic not Parquet-specific so we can also deal with JSON, CSV, ORC, and later Avro files. I know there are a number of people interested in this project, so I don't want to get in anyone's way. I'm tied up with other work this month at least so I likely won't be able to write any patches for this until September at the earliest. I'll be glad to give edit access to anyone who finds this document helpful and wants to add to it (e.g. JIRA links). Thanks, Wes