Re: Some initial GPU questions

Wes McKinney Tue, 26 Jun 2018 10:05:18 -0700

hi Anthony,

Antoine is right that a Device abstraction is needed. I hadn't seen
ARROW-2447 (I was on vacation in April) but I will comment there.

It would be helpful to collect more requirements from GPU users -- one
of the reasons that I set up the arrow/gpu project to begin with was
to help catalyze collaborations with the GPU community. Unfortunately,
that hasn't really happened yet after nearly a year, so hopefully we
can get more folks involved in the near future.

Some answers to your questions inline:

On Tue, Jun 26, 2018 at 11:55 AM, Anthony Scopatz <scop...@gmail.com> wrote:
> Hello All,
>
> As some of you may know, a few of us at Quansight have started (in
> parntership with NVIDIA) have started looking at Arrow's GPU capabilites.
> We are excited to help improve and expand Arrow's GPU support, but we did
> have a few initial scoping questions.
>
> Feel free to break these out into separate discussion threads if needed.
> Hopefully, some of them will be easy enough to answer.
>
>
>    1. What is the status of the GPU code in arrow now? E.g.
>    https://github.com/apache/arrow/tree/master/cpp/src/arrow/gpu Is anyone
>    actively working on this part of the code base? Are there other folks
>    working on GPU support? I'd love to chat, if so!

The code there is basically waiting for one or more stakeholder users
to get involved and help drive the roadmap. What is there now is
pretty basic.

To give you some context, I observed that some parts of this project
(IPC / data structure reconstruction on GPUs) were being reimplemented
in https://github.com/gpuopenanalytics/libgdf. So I started by setting
up basic abstractions to plug the CUDA driver API into Arrow's various
abstract interfaces for memory management and IO. I then implemented
GPU-specialized IPC read and write functions so that these code paths
in arrow/ipc can function without having the data be addressable in
CPU memory. See the GPU IPC unit tests here:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/gpu/cuda-test.cc#L311

I contributed some patches to MapD and hoped to rework more of their
Arrow interop to use these functions, but didn't get 100% the way
there last winter.

With MapD, libgdf, BlazingDB, and other current and future GPU Arrow
producers and consumers, I think there's plenty of components like
these that it would make sense to develop here.

>    2. Should arrow compute assume that everything fits in memory? Arrow
>    seem to handle data that is larger than memory via the Buffer API. Are
>    there restrictions that using Buffers imply that we should be aware of?

This is a large question. Many database systems work on
larger-than-memory datasets by splitting the problem into fragments
that do fit into memory. I think it would be reasonable to start by
developing computational functions that operate on in-memory data,
then leaving it up to a task scheduler implementation to orchestrate
an algorithm on larger-than-memory datasets. This is similar to how
Dask has used pandas to work in an out-of-core and distributed
setting.

>    3. What is the imagined interface be the pyarrow and a GPU DataFrame?
>    One idea is to have the selection of main memory and the GPU to be totally
>    transparent to the user. Another possible suggestion is to be explicit to
>    the user about where the data lives, for example:
>
>    >>> import pyarrow as pa
>    >>> a = pa.array(..., type=...) # create pyarrow array instance
>    >>> a_g = a.to_gpu(<device parameters>) # send `a` to GPU
>    >>> def foo(a): ... return ... # a function doing operations with `a`
>    >>> r = foo(a) # perform operations with `a`, runs on CPU
>    >>> r_g = foo(a_g) # perform operations with `a_g`, runs on GPU
>    >>> assert r == r_g.to_mem() # results are the same

Data frames are kind of a semantic construct. As an example, pandas
utilizes data structures and a mix of low-level algorithms that run
against NumPy arrays to define the semantics for what is a "pandas
DataFrame". But, since the Arrow columnar format was born from the
needs of analytic database systems and in-memory analytics systems
like pandas, we've captured more of the semantics of data frames than
in a generic array computing library.

In the case of Arrow, we have strived to be "front end agnostic", so
if the objective is to develop front ends for Python programmers, then
our goal would be to provide within pyarrow the data structures,
metadata, IO / data access, and computational building blocks to do
that. The pyarrow API is intended to give the developer explicit
control over as much as possible, so they can decide what happens and
when in their application or front-end API.

>    4. Who has been working on arrow compute kernels, are there any design
>    docs or discussions we should look at? We've been following the Gandiva
>    discussions and also the Ursa Labs Roadmap
>    <https://ursalabs.org/tech/#arrow-native-computation-engine>.

On the C++ side, it's been mostly me, Uwe, Phillip Cloud, and Antoine.
We built a few things to unblock some use cases we had (like type
casting). I expect that longer term we'll have a mix of pre-compiled
kernels (similar to TensorFlow's operator-kernel subsystem -- nearest
analogue I can think of) and runtime-compiled kernels (i.e. LLVM /
Gandiva-like)

I wrote up some my thoughts on this in the Ursa Labs document you
cited, but we don't have much in the way of roadmap documents for
function kernels in the Arrow community. I started a separate thread
about documentation organization in part to kickstart more roadmapping
-- I would say that the ASF Confluence space for Arrow would be the
best place for this work to happen.

>    5. Should the user be able be able to switch between compute
>    implementations at runtime, or only at compile time?

It has been my hope to develop kernel dispatch machinery that can take
into account the execution device in addition to the input types.
Currently, we are only doing kernel selection based on input types and
other kernel parameters. If, at dispatch time / runtime, the code
indicated that the data was on the GPU, then a GPU kernel would be
selected.

>    6. Arrow's CI doesn't currently seem to support GPUs. If a free GPU CI
>    service were to come along, would Arrow be open to using it?

Yes, I think so. Apache Spark has a Jenkins instance administered by
UC Berkeley that's integrated with their GitHub. I can imagine a
similar system where a bot will trigger builds in a GPU-enabled
Jenkins when certain conditions are met (commit message flags) or if
the developer requests.

>
> Other than that we'd love to know where and how we can plug in and help out!

Thanks! Glad to have more folks involved on this.

- Wes

>
> Be Well
> Anthony
> --
>
> Asst. Prof. Anthony Scopatz
> Nuclear Engineering Program
> Mechanical Engineering Dept.
> University of South Carolina
> scop...@cec.sc.edu
> Cell: (512) 827-8239
> Book a meeting with me at https://scopatz.youcanbook.me/
> Open up an issue: https://github.com/scopatz/me/issues
> Check my calendar
> <https://www.google.com/calendar/embed?src=scopatz%40gmail.com>

Re: Some initial GPU questions

Reply via email to