Hi Wes,

Yes, it makes sense.

If I understand you correctly then defining a device abstraction would also
bring Buffer and CudaBuffer under the same umbrella (that would be opposite
approach to ARROW-2446, btw).

This issue is also related to
  https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h
that defines a specification for data locality (for ndarrays but the
concept is the same for buffers).

ARROW-2447 defines API that uses Buffer::cpu_data(), hence also
Buffer::cuda_data(), Buffer::disk_data() etc.

I would like to propose a more general model (no guarantees that it would
make sense implementation-wise :) ):
0. CPU would be considered as any other device (this would be in line with
dlpack). To name few devices: HOST, CUDA, DISK, FPGA, etc. and why not
remote databases defined by URL.
1. A device is defined as a unit that has (i) a memory for holding data,
and (ii) it may have a processor(s) for processing the data (computations).
For instance, HOST device has RAM and CPU(s); a CUDA device has device
memory and GPU(s); a DISK device has memory but no processing unit, etc.
2. Different devices can access other devices memory using the same API
methods (say, Buffer.data()). For processing the data by a device (in case
the device has a processor), the data is copied to device memory on-demand,
unless the data is stored in the same device as the the processor. For
instance, for processing the CUDA data with CPU, HOST device would need to
copy CUDA device data to HOST memory (that works currently) and vice-versa
(that works as well, e.g. using CudaHostBuffer). In another setup, CUDA
device might need to use data from DISK: according to this proposal, the
DISK data would be copied directly to CUDA device (bypassing HOST memory if
technically possible).
So, in short, the implementation has to check whether the processor and the
memory are on the same device before processing the data, if not, the data
is copied using the on-demand approach. By on-demand approach, I mean that
the data references are passed around as a pair: (device id, device
pointer).
3. All the above is controlled from a master device process. Usually, the
master device would be HOST, but it does not have to be always so.

PS: I realize that this discussion diverges from the original subject, feel
free to rename the subject if needed.

Best regards,
Pearu







On Fri, Sep 28, 2018 at 12:49 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Pearu,
>
> Yes, I think it would be a good idea to develop some tools to make
> interacting with device memory using the existing data structures work
> seamlessly.
>
> This is all closely related to
>
> https://issues.apache.org/jira/browse/ARROW-2447
>
> I would say step 1 would be defining the device abstraction. Then we
> can add methods or properties to the data structures in pyarrow to
> show the location of the memory, whether CUDA or host RAM, etc. We
> could also have a memory-mapped device for memory maps to be able to
> communicate that data is on disk. We could then define virtual APIs
> for host-side data access to ensure that memory is copied to the host
> if needed (e.g. in the case of indexing into the values of an array)
>
> There are some small details around the handling of device in the case
> of hierarchical memory references. So if we say `buffer->GetDevice()`
> then even if it's a sliced buffer (which will be the case after using
> any IPC reader APIs), it needs to return the right device. This means
> that we probably need to define a SlicedBuffer type that delegates
> GetDevice() calls to the parent buffer
>
> Let me know if what I'm saying makes sense. Kou and Antoine probably
> have some thoughts about this also.
>
> - Wes
> On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
> <pearu.peter...@quansight.com> wrote:
> >
> > Hi,
> >
> > Consider the following use case:
> >
> > schema = <pa.Schema instance>
> > cbuf = <pa.cuda.CudaBuffer instance>
> > cbatch = pa.cuda.read_record_batch(schema, cbuf)
> >
> > Note that cbatch is pa.RecordBatch instance where data pointers are
> device
> > pointers.
> >
> > for col in cbatch.columns:
> >     # here col is, say, FloatArray, that data pointer is a device pointer
> >     # as a result, accessing col data, say, taking a slice, leads to
> > segfaults
> >     print(col[0])
> >
> > The aim of this message would be establishing a user-friendly way to
> > access, say, a slice of the device data so that only the requested data
> is
> > copied to host.
> >
> > Or more generally, should there be a CUDA specific RecordBatch that
> > implements RecordBatch API that can be used from host?
> >
> > For instance, this would be similar to DeviceNDArray in numba that
> > basically implements ndarray API for device data while the API can be
> used
> > from host.
> >
> > What do you think? What would be the proper approach? (I can do the
> > implementation).
> >
> > Best regards,
> > Pearu
>

Reply via email to