Re: Arrow Dataset API on Ceph

Jayjeet Chakraborty Tue, 01 Jun 2021 13:06:05 -0700

Dear Arrow Community,

In our previous discussion, we planned on implementing a new Dataset API
like InMemoryDataset to interact with objects containing IPC data stored in
Ceph/RADOS <https://ceph.io/>. We had implemented this design and raised a
PR <https://github.com/apache/arrow/pull/8647>. But when we started adding
the dataset discovery functionality, we found ourselves reimplementing
filesystem abstractions and its metadata management. We closed the original
PR and raised a new PR <https://github.com/apache/arrow/pull/10431> where
we redesigned our implementation to use the Ceph filesystem as our file I/O
interface since it provides fast metadata support via the Ceph metadata
servers (MDS). We also decided to store data using one of the file formats
supported by Arrow. One of our driving use cases favored Parquet.

Since we perform the scan operation inside the storage layer using Ceph
Object class
<https://docs.ceph.com/en/latest/rados/api/objclass-sdk/#:~:text=Ceph%20can%20be%20extended%20by,object%20classes%20within%20the%20tree.>
methods which need to be invoked directly on objects, we utilize the
striping strategy information provided by CephFS to translate filename in
CephFS to  object id in RADOS. To be able to have this one-to-one mapping,
we split Parquet files in a manner similar to how Spark splits Parquet
files for HDFS and ensure that each fragment is backed by a single RADOS
object.

We are planning a new PR, we extend the FileFormat interface to create a
RadosParquetFileFormat
<https://github.com/uccross/skyhookdm-arrow/blob/arrow-master/cpp/src/arrow/dataset/file_rados_parquet.h#L129>
interface that offloads Parquet file scan operations to the RADOS layer in
Ceph. Since we now utilize a filesystem interface, we can just use the
FileSystemDataset API and plug in our new format to offload scan
operations. We have also added Python bindings for the new APIs that we
implemented. In all, our patch only consists of around 3,000 LoC and
introduces new dependencies to Ceph’s librados and object class SDK only
(that can be disabled via cmake flags).

We have added an architecture
<https://github.com/uccross/skyhookdm-arrow/blob/rados-parquet-pr/cpp/src/arrow/adapters/arrow-rados-cls/docs/architecture.md>
document with our PR which describes the overall architecture along with
the life of a dataset scan on using RadosParquet. Additionally, we recently
wrote up a paper <https://arxiv.org/abs/2105.09894> describing our design
and implementation along with some initial benchmarks given there. We plan
to raise a PR <https://github.com/apache/arrow/pull/10431> to upstream our
format to apache/arrow soon and hence look forward to your comments and
thoughts on this new feature. Please let us know if you have any questions.
Thank you.

Best regards,

Jayjeet Chakraborty

On 2020/09/15 18:06:56, Micah Kornfield <[email protected]> wrote:
> gmock is already a dependency.  We haven't upgraded gmock/gtest in a
while,
> we might want to consider doing that (but this is orthogonal).
>
> On Tue, Sep 15, 2020 at 10:16 AM Antoine Pitrou <[email protected]>
wrote:
>
> >
> > Hi Ivo,
> >
> > You can open a JIRA once you've got a PR ready.  No need to do it before
> > you think you're ready for submission.
> >
> > AFAIK, gmock is already a dependency.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > Le 15/09/2020 à 18:49, Ivo Jimenez a écrit :
> > > Hi again,
> > >
> > > We noticed in the contribution guidelines that there needs to be an
> > issue for every PR in JIRA. Should we open one for the eventual PR for
the
> > work we're doing on implementing the dataset on Ceph's RADOS?
> > >
> > > Also, on a related note, we would like to mock the RADOS client so
that
> > we can integrate it in CI tests. Would it be OK to include gmock as a
> > dependency?
> > >
> > > thanks!
> > >
> > > On 2020/09/02 22:05:51, Ivo Jimenez <[email protected]> wrote:
> > >> Hi Ben,
> > >>
> > >>
> > >>>> Our main concern is that this new arrow::dataset::RadosFormat class
> > will
> > >>> be
> > >>>> deriving from the arrow::dataset::FileFormat class, which seems to
> > raise
> > >>> a
> > >>>> conceptual mismatch as there isn’t really a RADOS format
> > >>>
> > >>> IIUC RADOS doesn't interact with a filesystem directly, so
> > RadosFileFormat
> > >>> would
> > >>> indeed be a conceptually problematic point of extension. If a RADOS
> > file
> > >>> system
> > >>> is not viable then I think the ideal approach would be to directly
> > >>> implement the
> > >>> Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat
> > >>> implementation altogether.
> > >>> Unfortunately the only example we have of this approach is
> > >>> InMemoryFragment,
> > >>> which simply wraps a vector of record batches.
> > >>>
> > >>
> > >> This is what we will go with, as this seems to be the quickest way
for
> > us
> > >> to have a PoC and start experimenting with this.
> > >>
> > >> Thanks a lot for the invaluable feedback! 🙏
> > >>
> >
>

Re: Arrow Dataset API on Ceph

Reply via email to