Hey all, In order to facilitate the propagation of use cases that want to pass data allocated on non-cpu devices around between environments (like between Python and C++) we should enhance the C-Data API to account for passing memory and device information alongside the arrays themselves. In this vein, we want to make sure that it is designed well enough for future use cases since more and more devices (GPUs, FPGAs, etc.) are entering common usage without breaking any existing usages of the C-Data API (we don't want to break the ABI). After research and discussion with others, I'd like to propose a new update to the C-Data format which can be found as a draft PR[1] on the Arrow repo.
When developing this, a few things were used as inspiration including dlpack[2] and the CUDA Array Interface[3] (which is leveraged by Numba and others). That said, there are probably use cases and usages that I did not consider and so I'd like to solicit feedback and opinions from others on the design and proposal to ensure that it is sound and flexible enough for the ways people will want to use it. To explain a bit about the design: * To avoid breaking the existing ABI new ArrowDeviceArray and ArrowDeviceArrayStream structs were created. * The ArrowDeviceArray contains a pointer to an ArrowArray alongside the device information related to allocation. The reason for using a pointer is so that future modifications of the ArrowArray struct do not cause the size of this struct to change (as it would still just be a pointer to the ArrowArray struct). * To future-proof the design a bit, we also include an int64_t[2] at the end of the struct to manually pad 128 bytes into the struct that we can reserve for future modifications. In this way we can update the API and struct definition without breaking the ABI by changing the size of the struct itself. * The ArrowDeviceType enum uses the same values for the same devices that dlpack uses because dlpack is one of the most widely used/supported interfaces for raw device buffers[4], ensuring compatibility and easy conversions. * Because different frameworks use different types to represent "streams" for processing, we use a void* that is passed to the get_next function of the ArrowDeviceArrayStream struct so that the consumer can indicate what stream they want to use and that the producer should ensure that the data produced can be accessed by that stream (by way of waiting on required events/synchronizing streams/etc.) * More bits and bobs of the design are explained in comments that can be seen in the aforementioned draft PR. Remaining Concerns that I can think of: * Alignment and padding of allocations can have a larger impact when dealing with non-cpu devices than with CPUs, and this design provides no way to communicate potential extra padding on a per-buffer basis. We could attempt to codify a convention that allocations should have a specific alignment and a particular padding, but that doesn't actually enforce anything nor allow communicating if for some reason those conventions weren't followed. Should we add some way of passing this info or punt this for a future modification? * This design assumes that for a given Arrow Array/Record Batch, ALL buffers are allocated on the same device (including for child arrays), it does not provide granularity at the buffer level to indicate situations like putting the null validity bitmap on one device and the raw data on another device. Trying to allow for this would be quite difficult and would likely be better served by creating an entirely new ArrowArray struct from the ground up rather than embedding a pointer to the existing ArrowArray struct. So if there is a desire for that level of granularity, I figured we could simply evolve the ABI to use something like an ArrowArrayV2, etc. In general it would be quite inefficient to split the buffers like that for any sort of processing, so I didn't prioritize that case. * In order to know what type of "stream" pointer should be passed to get_next, there is a DeviceType in the ArrowDeviceArrayStream struct which should be the device that all arrays in that stream are produced on. The idea being that if you want to produce for multiple devices, you should provide multiple streams. Alternately, we could have a callback like `get_next_device` or allow the consumer to provide the device they want the data on in the `get_next` function. Thanks in advance to everyone who contributes to the discussion here, hopefully we can all settle on a design we're happy with and then I'll attempt to do some implementing of this idea and updates to the documentation. This is part of a wider effort I'm attempting to address to improve the non-cpu memory support in the Arrow libraries, such as enhanced Buffer types in the C++ library that will have the device_id and device_type information in addition to the `is_cpu` flag that currently exists. Looking forward to any concerns people might have, suggestions, or other use cases that aren't addressed by this design. Thanks everyone!! --Matt [1]: https://github.com/apache/arrow/pull/34972 [2]: https://github.com/dmlc/dlpack/blob/main/include/dlpack/dlpack.h [3]: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html [4]: https://data-apis.org/array-api/latest/design_topics/data_interchange.html#data-interchange