Dear Arrow Community, In our previous discussion, we planned on implementing a new Dataset API like InMemoryDataset to interact with objects containing IPC data stored in Ceph/RADOS <https://ceph.io/>. We had implemented this design and raised a PR <https://github.com/apache/arrow/pull/8647>. But when we started adding the dataset discovery functionality, we found ourselves reimplementing filesystem abstractions and its metadata management. We closed the original PR and raised a new PR <https://github.com/apache/arrow/pull/10431> where we redesigned our implementation to use the Ceph filesystem as our file I/O interface since it provides fast metadata support via the Ceph metadata servers (MDS). We also decided to store data using one of the file formats supported by Arrow. One of our driving use cases favored Parquet.
Since we perform the scan operation inside the storage layer using Ceph Object class <https://docs.ceph.com/en/latest/rados/api/objclass-sdk/#:~:text=Ceph%20can%20be%20extended%20by,object%20classes%20within%20the%20tree.> methods which need to be invoked directly on objects, we utilize the striping strategy information provided by CephFS to translate filename in CephFS to object id in RADOS. To be able to have this one-to-one mapping, we split Parquet files in a manner similar to how Spark splits Parquet files for HDFS and ensure that each fragment is backed by a single RADOS object. We are planning a new PR, we extend the FileFormat interface to create a RadosParquetFileFormat <https://github.com/uccross/skyhookdm-arrow/blob/arrow-master/cpp/src/arrow/dataset/file_rados_parquet.h#L129> interface that offloads Parquet file scan operations to the RADOS layer in Ceph. Since we now utilize a filesystem interface, we can just use the FileSystemDataset API and plug in our new format to offload scan operations. We have also added Python bindings for the new APIs that we implemented. In all, our patch only consists of around 3,000 LoC and introduces new dependencies to Ceph’s librados and object class SDK only (that can be disabled via cmake flags). We have added an architecture <https://github.com/uccross/skyhookdm-arrow/blob/rados-parquet-pr/cpp/src/arrow/adapters/arrow-rados-cls/docs/architecture.md> document with our PR which describes the overall architecture along with the life of a dataset scan on using RadosParquet. Additionally, we recently wrote up a paper <https://arxiv.org/abs/2105.09894> describing our design and implementation along with some initial benchmarks given there. We plan to raise a PR <https://github.com/apache/arrow/pull/10431> to upstream our format to apache/arrow soon and hence look forward to your comments and thoughts on this new feature. Please let us know if you have any questions. Thank you. Best regards, Jayjeet Chakraborty On 2020/09/15 18:06:56, Micah Kornfield <emkornfi...@gmail.com> wrote: > gmock is already a dependency. We haven't upgraded gmock/gtest in a while, > we might want to consider doing that (but this is orthogonal). > > On Tue, Sep 15, 2020 at 10:16 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > Hi Ivo, > > > > You can open a JIRA once you've got a PR ready. No need to do it before > > you think you're ready for submission. > > > > AFAIK, gmock is already a dependency. > > > > Regards > > > > Antoine. > > > > > > > > Le 15/09/2020 à 18:49, Ivo Jimenez a écrit : > > > Hi again, > > > > > > We noticed in the contribution guidelines that there needs to be an > > issue for every PR in JIRA. Should we open one for the eventual PR for the > > work we're doing on implementing the dataset on Ceph's RADOS? > > > > > > Also, on a related note, we would like to mock the RADOS client so that > > we can integrate it in CI tests. Would it be OK to include gmock as a > > dependency? > > > > > > thanks! > > > > > > On 2020/09/02 22:05:51, Ivo Jimenez <ivo.jime...@gmail.com> wrote: > > >> Hi Ben, > > >> > > >> > > >>>> Our main concern is that this new arrow::dataset::RadosFormat class > > will > > >>> be > > >>>> deriving from the arrow::dataset::FileFormat class, which seems to > > raise > > >>> a > > >>>> conceptual mismatch as there isn’t really a RADOS format > > >>> > > >>> IIUC RADOS doesn't interact with a filesystem directly, so > > RadosFileFormat > > >>> would > > >>> indeed be a conceptually problematic point of extension. If a RADOS > > file > > >>> system > > >>> is not viable then I think the ideal approach would be to directly > > >>> implement the > > >>> Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat > > >>> implementation altogether. > > >>> Unfortunately the only example we have of this approach is > > >>> InMemoryFragment, > > >>> which simply wraps a vector of record batches. > > >>> > > >> > > >> This is what we will go with, as this seems to be the quickest way for > > us > > >> to have a PoC and start experimenting with this. > > >> > > >> Thanks a lot for the invaluable feedback! 🙏 > > >> > > >