What are the action items on this? Sounds like we need to start a
design document. I'm afraid I don't have the bandwidth to champion GPU
functionality at the moment but I will participate in design
discussions and help break down complex tasks into more accessible
JIRA issues.

Thanks
Wes
On Fri, Sep 28, 2018 at 9:44 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> Seems like there is a fair bit of work to do to specify APIs and
> semantics. I suggest we create a Google document or something
> collaborative where we can enumerate and discuss the issues we want to
> resolve, and then make a list of the concrete development.
>
> The underlying problem IMHO in ARROW-2446 is that we do not have the
> notion of device. An instance of CudaBuffer is only necessary so that
> the appropriate virtual dtor can be invoked to release the memory. As
> long as a buffer referencing it is aware of the underlying device,
> then our code can dispatch to the correct code paths. At the moment we
> can only really detect whether an arrow::Buffer* is a device buffer by
> dynamic_cast, and then that is not reliable because we may be a slice
> On Fri, Sep 28, 2018 at 7:17 AM Pearu Peterson
> <pearu.peter...@quansight.com> wrote:
> >
> > Hi Wes,
> >
> > Yes, it makes sense.
> >
> > If I understand you correctly then defining a device abstraction would also
> > bring Buffer and CudaBuffer under the same umbrella (that would be opposite
> > approach to ARROW-2446, btw).
> >
> > This issue is also related to
> >   https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h
> > that defines a specification for data locality (for ndarrays but the
> > concept is the same for buffers).
> >
> > ARROW-2447 defines API that uses Buffer::cpu_data(), hence also
> > Buffer::cuda_data(), Buffer::disk_data() etc.
> >
> > I would like to propose a more general model (no guarantees that it would
> > make sense implementation-wise :) ):
> > 0. CPU would be considered as any other device (this would be in line with
> > dlpack). To name few devices: HOST, CUDA, DISK, FPGA, etc. and why not
> > remote databases defined by URL.
> > 1. A device is defined as a unit that has (i) a memory for holding data,
> > and (ii) it may have a processor(s) for processing the data (computations).
> > For instance, HOST device has RAM and CPU(s); a CUDA device has device
> > memory and GPU(s); a DISK device has memory but no processing unit, etc.
> > 2. Different devices can access other devices memory using the same API
> > methods (say, Buffer.data()). For processing the data by a device (in case
> > the device has a processor), the data is copied to device memory on-demand,
> > unless the data is stored in the same device as the the processor. For
> > instance, for processing the CUDA data with CPU, HOST device would need to
> > copy CUDA device data to HOST memory (that works currently) and vice-versa
> > (that works as well, e.g. using CudaHostBuffer). In another setup, CUDA
> > device might need to use data from DISK: according to this proposal, the
> > DISK data would be copied directly to CUDA device (bypassing HOST memory if
> > technically possible).
> > So, in short, the implementation has to check whether the processor and the
> > memory are on the same device before processing the data, if not, the data
> > is copied using the on-demand approach. By on-demand approach, I mean that
> > the data references are passed around as a pair: (device id, device
> > pointer).
> > 3. All the above is controlled from a master device process. Usually, the
> > master device would be HOST, but it does not have to be always so.
> >
> > PS: I realize that this discussion diverges from the original subject, feel
> > free to rename the subject if needed.
> >
> > Best regards,
> > Pearu
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Sep 28, 2018 at 12:49 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > hi Pearu,
> > >
> > > Yes, I think it would be a good idea to develop some tools to make
> > > interacting with device memory using the existing data structures work
> > > seamlessly.
> > >
> > > This is all closely related to
> > >
> > > https://issues.apache.org/jira/browse/ARROW-2447
> > >
> > > I would say step 1 would be defining the device abstraction. Then we
> > > can add methods or properties to the data structures in pyarrow to
> > > show the location of the memory, whether CUDA or host RAM, etc. We
> > > could also have a memory-mapped device for memory maps to be able to
> > > communicate that data is on disk. We could then define virtual APIs
> > > for host-side data access to ensure that memory is copied to the host
> > > if needed (e.g. in the case of indexing into the values of an array)
> > >
> > > There are some small details around the handling of device in the case
> > > of hierarchical memory references. So if we say `buffer->GetDevice()`
> > > then even if it's a sliced buffer (which will be the case after using
> > > any IPC reader APIs), it needs to return the right device. This means
> > > that we probably need to define a SlicedBuffer type that delegates
> > > GetDevice() calls to the parent buffer
> > >
> > > Let me know if what I'm saying makes sense. Kou and Antoine probably
> > > have some thoughts about this also.
> > >
> > > - Wes
> > > On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
> > > <pearu.peter...@quansight.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Consider the following use case:
> > > >
> > > > schema = <pa.Schema instance>
> > > > cbuf = <pa.cuda.CudaBuffer instance>
> > > > cbatch = pa.cuda.read_record_batch(schema, cbuf)
> > > >
> > > > Note that cbatch is pa.RecordBatch instance where data pointers are
> > > device
> > > > pointers.
> > > >
> > > > for col in cbatch.columns:
> > > >     # here col is, say, FloatArray, that data pointer is a device 
> > > > pointer
> > > >     # as a result, accessing col data, say, taking a slice, leads to
> > > > segfaults
> > > >     print(col[0])
> > > >
> > > > The aim of this message would be establishing a user-friendly way to
> > > > access, say, a slice of the device data so that only the requested data
> > > is
> > > > copied to host.
> > > >
> > > > Or more generally, should there be a CUDA specific RecordBatch that
> > > > implements RecordBatch API that can be used from host?
> > > >
> > > > For instance, this would be similar to DeviceNDArray in numba that
> > > > basically implements ndarray API for device data while the API can be
> > > used
> > > > from host.
> > > >
> > > > What do you think? What would be the proper approach? (I can do the
> > > > implementation).
> > > >
> > > > Best regards,
> > > > Pearu
> > >

Reply via email to