Hi,
Currently, the arrow host memory management includes MemoryPool to
accelerate memory operations (new/free).
Would there be interest in supporting the same concept in CUDA memory
management to reduce the overhead of cudaMalloc/cudaFree?
Best regards,
Pearu

On Wed, Oct 3, 2018 at 11:44 PM Pearu Peterson <pearu.peter...@quansight.com>
wrote:

> Hi,
> I can make the initial design document from the existing comments.
> Do you have examples of some earlier design documents used for similar
> purpose? Would shared google docs be OK?
>
> Btw, I also figured out an answer to my original question, here is a
> partial codelet for accessing the batch columns that I was missing:
>
> cbuf = <CudaBuffer instance>
> cbatch = pa.cuda.read_record_batch(cbuf, schema)
> for col in cbatch:
>     null_buf, data_buf = col.buffers()
>     cdata_buf = CudaBuffer.from_buffer(data_buf)
>     if null_buf is not None: ...
>     ...
>
> This is used in CudaNDArray that allows accessing the items from host,
> very similar to DeviceNDArray of numba.cuda:
>   https://github.com/Quansight/pygdf/blob/arrow-cuda/pygdf/cudaarray.py
> (excuse the coding, its wip and experimental)
>
> Best regards,
> Pearu
>
>
>
>
> On Wed, Oct 3, 2018 at 11:29 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> What are the action items on this? Sounds like we need to start a
>> design document. I'm afraid I don't have the bandwidth to champion GPU
>> functionality at the moment but I will participate in design
>> discussions and help break down complex tasks into more accessible
>> JIRA issues.
>>
>> Thanks
>> Wes
>> On Fri, Sep 28, 2018 at 9:44 AM Wes McKinney <wesmck...@gmail.com> wrote:
>> >
>> > Seems like there is a fair bit of work to do to specify APIs and
>> > semantics. I suggest we create a Google document or something
>> > collaborative where we can enumerate and discuss the issues we want to
>> > resolve, and then make a list of the concrete development.
>> >
>> > The underlying problem IMHO in ARROW-2446 is that we do not have the
>> > notion of device. An instance of CudaBuffer is only necessary so that
>> > the appropriate virtual dtor can be invoked to release the memory. As
>> > long as a buffer referencing it is aware of the underlying device,
>> > then our code can dispatch to the correct code paths. At the moment we
>> > can only really detect whether an arrow::Buffer* is a device buffer by
>> > dynamic_cast, and then that is not reliable because we may be a slice
>> > On Fri, Sep 28, 2018 at 7:17 AM Pearu Peterson
>> > <pearu.peter...@quansight.com> wrote:
>> > >
>> > > Hi Wes,
>> > >
>> > > Yes, it makes sense.
>> > >
>> > > If I understand you correctly then defining a device abstraction
>> would also
>> > > bring Buffer and CudaBuffer under the same umbrella (that would be
>> opposite
>> > > approach to ARROW-2446, btw).
>> > >
>> > > This issue is also related to
>> > >   https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h
>> > > that defines a specification for data locality (for ndarrays but the
>> > > concept is the same for buffers).
>> > >
>> > > ARROW-2447 defines API that uses Buffer::cpu_data(), hence also
>> > > Buffer::cuda_data(), Buffer::disk_data() etc.
>> > >
>> > > I would like to propose a more general model (no guarantees that it
>> would
>> > > make sense implementation-wise :) ):
>> > > 0. CPU would be considered as any other device (this would be in line
>> with
>> > > dlpack). To name few devices: HOST, CUDA, DISK, FPGA, etc. and why not
>> > > remote databases defined by URL.
>> > > 1. A device is defined as a unit that has (i) a memory for holding
>> data,
>> > > and (ii) it may have a processor(s) for processing the data
>> (computations).
>> > > For instance, HOST device has RAM and CPU(s); a CUDA device has device
>> > > memory and GPU(s); a DISK device has memory but no processing unit,
>> etc.
>> > > 2. Different devices can access other devices memory using the same
>> API
>> > > methods (say, Buffer.data()). For processing the data by a device (in
>> case
>> > > the device has a processor), the data is copied to device memory
>> on-demand,
>> > > unless the data is stored in the same device as the the processor. For
>> > > instance, for processing the CUDA data with CPU, HOST device would
>> need to
>> > > copy CUDA device data to HOST memory (that works currently) and
>> vice-versa
>> > > (that works as well, e.g. using CudaHostBuffer). In another setup,
>> CUDA
>> > > device might need to use data from DISK: according to this proposal,
>> the
>> > > DISK data would be copied directly to CUDA device (bypassing HOST
>> memory if
>> > > technically possible).
>> > > So, in short, the implementation has to check whether the processor
>> and the
>> > > memory are on the same device before processing the data, if not, the
>> data
>> > > is copied using the on-demand approach. By on-demand approach, I mean
>> that
>> > > the data references are passed around as a pair: (device id, device
>> > > pointer).
>> > > 3. All the above is controlled from a master device process. Usually,
>> the
>> > > master device would be HOST, but it does not have to be always so.
>> > >
>> > > PS: I realize that this discussion diverges from the original
>> subject, feel
>> > > free to rename the subject if needed.
>> > >
>> > > Best regards,
>> > > Pearu
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Fri, Sep 28, 2018 at 12:49 PM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> > >
>> > > > hi Pearu,
>> > > >
>> > > > Yes, I think it would be a good idea to develop some tools to make
>> > > > interacting with device memory using the existing data structures
>> work
>> > > > seamlessly.
>> > > >
>> > > > This is all closely related to
>> > > >
>> > > > https://issues.apache.org/jira/browse/ARROW-2447
>> > > >
>> > > > I would say step 1 would be defining the device abstraction. Then we
>> > > > can add methods or properties to the data structures in pyarrow to
>> > > > show the location of the memory, whether CUDA or host RAM, etc. We
>> > > > could also have a memory-mapped device for memory maps to be able to
>> > > > communicate that data is on disk. We could then define virtual APIs
>> > > > for host-side data access to ensure that memory is copied to the
>> host
>> > > > if needed (e.g. in the case of indexing into the values of an array)
>> > > >
>> > > > There are some small details around the handling of device in the
>> case
>> > > > of hierarchical memory references. So if we say
>> `buffer->GetDevice()`
>> > > > then even if it's a sliced buffer (which will be the case after
>> using
>> > > > any IPC reader APIs), it needs to return the right device. This
>> means
>> > > > that we probably need to define a SlicedBuffer type that delegates
>> > > > GetDevice() calls to the parent buffer
>> > > >
>> > > > Let me know if what I'm saying makes sense. Kou and Antoine probably
>> > > > have some thoughts about this also.
>> > > >
>> > > > - Wes
>> > > > On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
>> > > > <pearu.peter...@quansight.com> wrote:
>> > > > >
>> > > > > Hi,
>> > > > >
>> > > > > Consider the following use case:
>> > > > >
>> > > > > schema = <pa.Schema instance>
>> > > > > cbuf = <pa.cuda.CudaBuffer instance>
>> > > > > cbatch = pa.cuda.read_record_batch(schema, cbuf)
>> > > > >
>> > > > > Note that cbatch is pa.RecordBatch instance where data pointers
>> are
>> > > > device
>> > > > > pointers.
>> > > > >
>> > > > > for col in cbatch.columns:
>> > > > >     # here col is, say, FloatArray, that data pointer is a device
>> pointer
>> > > > >     # as a result, accessing col data, say, taking a slice, leads
>> to
>> > > > > segfaults
>> > > > >     print(col[0])
>> > > > >
>> > > > > The aim of this message would be establishing a user-friendly way
>> to
>> > > > > access, say, a slice of the device data so that only the
>> requested data
>> > > > is
>> > > > > copied to host.
>> > > > >
>> > > > > Or more generally, should there be a CUDA specific RecordBatch
>> that
>> > > > > implements RecordBatch API that can be used from host?
>> > > > >
>> > > > > For instance, this would be similar to DeviceNDArray in numba that
>> > > > > basically implements ndarray API for device data while the API
>> can be
>> > > > used
>> > > > > from host.
>> > > > >
>> > > > > What do you think? What would be the proper approach? (I can do
>> the
>> > > > > implementation).
>> > > > >
>> > > > > Best regards,
>> > > > > Pearu
>> > > >
>>
>

Reply via email to