hi Anthony, Antoine is right that a Device abstraction is needed. I hadn't seen ARROW-2447 (I was on vacation in April) but I will comment there.
It would be helpful to collect more requirements from GPU users -- one of the reasons that I set up the arrow/gpu project to begin with was to help catalyze collaborations with the GPU community. Unfortunately, that hasn't really happened yet after nearly a year, so hopefully we can get more folks involved in the near future. Some answers to your questions inline: On Tue, Jun 26, 2018 at 11:55 AM, Anthony Scopatz <scop...@gmail.com> wrote: > Hello All, > > As some of you may know, a few of us at Quansight have started (in > parntership with NVIDIA) have started looking at Arrow's GPU capabilites. > We are excited to help improve and expand Arrow's GPU support, but we did > have a few initial scoping questions. > > Feel free to break these out into separate discussion threads if needed. > Hopefully, some of them will be easy enough to answer. > > > 1. What is the status of the GPU code in arrow now? E.g. > https://github.com/apache/arrow/tree/master/cpp/src/arrow/gpu Is anyone > actively working on this part of the code base? Are there other folks > working on GPU support? I'd love to chat, if so! The code there is basically waiting for one or more stakeholder users to get involved and help drive the roadmap. What is there now is pretty basic. To give you some context, I observed that some parts of this project (IPC / data structure reconstruction on GPUs) were being reimplemented in https://github.com/gpuopenanalytics/libgdf. So I started by setting up basic abstractions to plug the CUDA driver API into Arrow's various abstract interfaces for memory management and IO. I then implemented GPU-specialized IPC read and write functions so that these code paths in arrow/ipc can function without having the data be addressable in CPU memory. See the GPU IPC unit tests here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/gpu/cuda-test.cc#L311 I contributed some patches to MapD and hoped to rework more of their Arrow interop to use these functions, but didn't get 100% the way there last winter. With MapD, libgdf, BlazingDB, and other current and future GPU Arrow producers and consumers, I think there's plenty of components like these that it would make sense to develop here. > 2. Should arrow compute assume that everything fits in memory? Arrow > seem to handle data that is larger than memory via the Buffer API. Are > there restrictions that using Buffers imply that we should be aware of? This is a large question. Many database systems work on larger-than-memory datasets by splitting the problem into fragments that do fit into memory. I think it would be reasonable to start by developing computational functions that operate on in-memory data, then leaving it up to a task scheduler implementation to orchestrate an algorithm on larger-than-memory datasets. This is similar to how Dask has used pandas to work in an out-of-core and distributed setting. > 3. What is the imagined interface be the pyarrow and a GPU DataFrame? > One idea is to have the selection of main memory and the GPU to be totally > transparent to the user. Another possible suggestion is to be explicit to > the user about where the data lives, for example: > > >>> import pyarrow as pa > >>> a = pa.array(..., type=...) # create pyarrow array instance > >>> a_g = a.to_gpu(<device parameters>) # send `a` to GPU > >>> def foo(a): ... return ... # a function doing operations with `a` > >>> r = foo(a) # perform operations with `a`, runs on CPU > >>> r_g = foo(a_g) # perform operations with `a_g`, runs on GPU > >>> assert r == r_g.to_mem() # results are the same Data frames are kind of a semantic construct. As an example, pandas utilizes data structures and a mix of low-level algorithms that run against NumPy arrays to define the semantics for what is a "pandas DataFrame". But, since the Arrow columnar format was born from the needs of analytic database systems and in-memory analytics systems like pandas, we've captured more of the semantics of data frames than in a generic array computing library. In the case of Arrow, we have strived to be "front end agnostic", so if the objective is to develop front ends for Python programmers, then our goal would be to provide within pyarrow the data structures, metadata, IO / data access, and computational building blocks to do that. The pyarrow API is intended to give the developer explicit control over as much as possible, so they can decide what happens and when in their application or front-end API. > 4. Who has been working on arrow compute kernels, are there any design > docs or discussions we should look at? We've been following the Gandiva > discussions and also the Ursa Labs Roadmap > <https://ursalabs.org/tech/#arrow-native-computation-engine>. On the C++ side, it's been mostly me, Uwe, Phillip Cloud, and Antoine. We built a few things to unblock some use cases we had (like type casting). I expect that longer term we'll have a mix of pre-compiled kernels (similar to TensorFlow's operator-kernel subsystem -- nearest analogue I can think of) and runtime-compiled kernels (i.e. LLVM / Gandiva-like) I wrote up some my thoughts on this in the Ursa Labs document you cited, but we don't have much in the way of roadmap documents for function kernels in the Arrow community. I started a separate thread about documentation organization in part to kickstart more roadmapping -- I would say that the ASF Confluence space for Arrow would be the best place for this work to happen. > 5. Should the user be able be able to switch between compute > implementations at runtime, or only at compile time? It has been my hope to develop kernel dispatch machinery that can take into account the execution device in addition to the input types. Currently, we are only doing kernel selection based on input types and other kernel parameters. If, at dispatch time / runtime, the code indicated that the data was on the GPU, then a GPU kernel would be selected. > 6. Arrow's CI doesn't currently seem to support GPUs. If a free GPU CI > service were to come along, would Arrow be open to using it? Yes, I think so. Apache Spark has a Jenkins instance administered by UC Berkeley that's integrated with their GitHub. I can imagine a similar system where a bot will trigger builds in a GPU-enabled Jenkins when certain conditions are met (commit message flags) or if the developer requests. > > Other than that we'd love to know where and how we can plug in and help out! Thanks! Glad to have more folks involved on this. - Wes > > Be Well > Anthony > -- > > Asst. Prof. Anthony Scopatz > Nuclear Engineering Program > Mechanical Engineering Dept. > University of South Carolina > scop...@cec.sc.edu > Cell: (512) 827-8239 > Book a meeting with me at https://scopatz.youcanbook.me/ > Open up an issue: https://github.com/scopatz/me/issues > Check my calendar > <https://www.google.com/calendar/embed?src=scopatz%40gmail.com>