Hey all,

In order to facilitate the propagation of use cases that want to pass data
allocated on non-cpu devices around between environments (like between
Python and C++) we should enhance the C-Data API to account for passing
memory and device information alongside the arrays themselves. In this
vein, we want to make sure that it is designed well enough for future use
cases since more and more devices (GPUs, FPGAs, etc.) are entering common
usage without breaking any existing usages of the C-Data API (we don't want
to break the ABI). After research and discussion with others, I'd like to
propose a new update to the C-Data format which can be found as a draft
PR[1] on the Arrow repo.

When developing this, a few things were used as inspiration including
dlpack[2] and the CUDA Array Interface[3] (which is leveraged by Numba and
others). That said, there are probably use cases and usages that I did not
consider and so I'd like to solicit feedback and opinions from others on
the design and proposal to ensure that it is sound and flexible enough for
the ways people will want to use it.

To explain a bit about the design:
    * To avoid breaking the existing ABI new ArrowDeviceArray and
ArrowDeviceArrayStream structs were created.
    * The ArrowDeviceArray contains a pointer to an ArrowArray alongside
the device information related to allocation. The reason for using a
pointer is so that future modifications of the ArrowArray struct do not
cause the size of this struct to change (as it would still just be a
pointer to the ArrowArray struct).
    * To future-proof the design a bit, we also include an int64_t[2] at
the end of the struct to manually pad 128 bytes into the struct that we can
reserve for future modifications. In this way we can update the API and
struct definition without breaking the ABI by changing the size of the
struct itself.
    * The ArrowDeviceType enum uses the same values for the same devices
that dlpack uses because dlpack is one of the most widely used/supported
interfaces for raw device buffers[4], ensuring compatibility and easy
conversions.
    * Because different frameworks use different types to represent
"streams" for processing, we use a void* that is passed to the get_next
function of the ArrowDeviceArrayStream struct so that the consumer can
indicate what stream they want to use and that the producer should ensure
that the data produced can be accessed by that stream (by way of waiting on
required events/synchronizing streams/etc.)
    * More bits and bobs of the design are explained in comments that can
be seen in the aforementioned draft PR.

Remaining Concerns that I can think of:
    * Alignment and padding of allocations can have a larger impact when
dealing with non-cpu devices than with CPUs, and this design provides no
way to communicate potential extra padding on a per-buffer basis. We could
attempt to codify a convention that allocations should have a specific
alignment and a particular padding, but that doesn't actually enforce
anything nor allow communicating if for some reason those conventions
weren't followed. Should we add some way of passing this info or punt this
for a future modification?
    * This design assumes that for a given Arrow Array/Record Batch, ALL
buffers are allocated on the same device (including for child arrays), it
does not provide granularity at the buffer level to indicate situations
like putting the null validity bitmap on one device and the raw data on
another device. Trying to allow for this would be quite difficult and would
likely be better served by creating an entirely new ArrowArray struct from
the ground up rather than embedding a pointer to the existing ArrowArray
struct. So if there is a desire for that level of granularity, I figured we
could simply evolve the ABI to use something like an ArrowArrayV2, etc. In
general it would be quite inefficient to split the buffers like that for
any sort of processing, so I didn't prioritize that case.
    * In order to know what type of "stream" pointer should be passed to
get_next, there is a DeviceType in the ArrowDeviceArrayStream struct which
should be the device that all arrays in that stream are produced on. The
idea being that if you want to produce for multiple devices, you should
provide multiple streams. Alternately, we could have a callback like
`get_next_device` or allow the consumer to provide the device they want the
data on in the `get_next` function.

Thanks in advance to everyone who contributes to the discussion here,
hopefully we can all settle on a design we're happy with and then I'll
attempt to do some implementing of this idea and updates to the
documentation. This is part of a wider effort I'm attempting to address to
improve the non-cpu memory support in the Arrow libraries, such as enhanced
Buffer types in the C++ library that will have the device_id and
device_type information in addition to the `is_cpu` flag that currently
exists.

Looking forward to any concerns people might have, suggestions, or other
use cases that aren't addressed by this design. Thanks everyone!!

--Matt

[1]: https://github.com/apache/arrow/pull/34972
[2]: https://github.com/dmlc/dlpack/blob/main/include/dlpack/dlpack.h
[3]: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html
[4]:
https://data-apis.org/array-api/latest/design_topics/data_interchange.html#data-interchange

Reply via email to