Hello Wes, Antoine, Thanks for your very detailed responses!
It is really good to know that what is in arrow/gpu now is already setup to integrate into various GPU producer / consumers. The other responses made sense (assume in-memory and rely on orchestration, explicit over implicit, roadmap discussions on confluence, integrating CIs). Thanks again! Be Well Anthony On Tue, Jun 26, 2018 at 1:04 PM Wes McKinney <[email protected]> wrote: > hi Anthony, > > Antoine is right that a Device abstraction is needed. I hadn't seen > ARROW-2447 (I was on vacation in April) but I will comment there. > > It would be helpful to collect more requirements from GPU users -- one > of the reasons that I set up the arrow/gpu project to begin with was > to help catalyze collaborations with the GPU community. Unfortunately, > that hasn't really happened yet after nearly a year, so hopefully we > can get more folks involved in the near future. > > Some answers to your questions inline: > > On Tue, Jun 26, 2018 at 11:55 AM, Anthony Scopatz <[email protected]> > wrote: > > Hello All, > > > > As some of you may know, a few of us at Quansight have started (in > > parntership with NVIDIA) have started looking at Arrow's GPU capabilites. > > We are excited to help improve and expand Arrow's GPU support, but we did > > have a few initial scoping questions. > > > > Feel free to break these out into separate discussion threads if needed. > > Hopefully, some of them will be easy enough to answer. > > > > > > 1. What is the status of the GPU code in arrow now? E.g. > > https://github.com/apache/arrow/tree/master/cpp/src/arrow/gpu Is > anyone > > actively working on this part of the code base? Are there other folks > > working on GPU support? I'd love to chat, if so! > > The code there is basically waiting for one or more stakeholder users > to get involved and help drive the roadmap. What is there now is > pretty basic. > > To give you some context, I observed that some parts of this project > (IPC / data structure reconstruction on GPUs) were being reimplemented > in https://github.com/gpuopenanalytics/libgdf. So I started by setting > up basic abstractions to plug the CUDA driver API into Arrow's various > abstract interfaces for memory management and IO. I then implemented > GPU-specialized IPC read and write functions so that these code paths > in arrow/ipc can function without having the data be addressable in > CPU memory. See the GPU IPC unit tests here: > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/gpu/cuda-test.cc#L311 > > I contributed some patches to MapD and hoped to rework more of their > Arrow interop to use these functions, but didn't get 100% the way > there last winter. > > With MapD, libgdf, BlazingDB, and other current and future GPU Arrow > producers and consumers, I think there's plenty of components like > these that it would make sense to develop here. > > > 2. Should arrow compute assume that everything fits in memory? Arrow > > seem to handle data that is larger than memory via the Buffer API. Are > > there restrictions that using Buffers imply that we should be aware > of? > > This is a large question. Many database systems work on > larger-than-memory datasets by splitting the problem into fragments > that do fit into memory. I think it would be reasonable to start by > developing computational functions that operate on in-memory data, > then leaving it up to a task scheduler implementation to orchestrate > an algorithm on larger-than-memory datasets. This is similar to how > Dask has used pandas to work in an out-of-core and distributed > setting. > > > 3. What is the imagined interface be the pyarrow and a GPU DataFrame? > > One idea is to have the selection of main memory and the GPU to be > totally > > transparent to the user. Another possible suggestion is to be > explicit to > > the user about where the data lives, for example: > > > > >>> import pyarrow as pa > > >>> a = pa.array(..., type=...) # create pyarrow array instance > > >>> a_g = a.to_gpu(<device parameters>) # send `a` to GPU > > >>> def foo(a): ... return ... # a function doing operations with `a` > > >>> r = foo(a) # perform operations with `a`, runs on CPU > > >>> r_g = foo(a_g) # perform operations with `a_g`, runs on GPU > > >>> assert r == r_g.to_mem() # results are the same > > Data frames are kind of a semantic construct. As an example, pandas > utilizes data structures and a mix of low-level algorithms that run > against NumPy arrays to define the semantics for what is a "pandas > DataFrame". But, since the Arrow columnar format was born from the > needs of analytic database systems and in-memory analytics systems > like pandas, we've captured more of the semantics of data frames than > in a generic array computing library. > > In the case of Arrow, we have strived to be "front end agnostic", so > if the objective is to develop front ends for Python programmers, then > our goal would be to provide within pyarrow the data structures, > metadata, IO / data access, and computational building blocks to do > that. The pyarrow API is intended to give the developer explicit > control over as much as possible, so they can decide what happens and > when in their application or front-end API. > > > 4. Who has been working on arrow compute kernels, are there any design > > docs or discussions we should look at? We've been following the > Gandiva > > discussions and also the Ursa Labs Roadmap > > <https://ursalabs.org/tech/#arrow-native-computation-engine>. > > On the C++ side, it's been mostly me, Uwe, Phillip Cloud, and Antoine. > We built a few things to unblock some use cases we had (like type > casting). I expect that longer term we'll have a mix of pre-compiled > kernels (similar to TensorFlow's operator-kernel subsystem -- nearest > analogue I can think of) and runtime-compiled kernels (i.e. LLVM / > Gandiva-like) > > I wrote up some my thoughts on this in the Ursa Labs document you > cited, but we don't have much in the way of roadmap documents for > function kernels in the Arrow community. I started a separate thread > about documentation organization in part to kickstart more roadmapping > -- I would say that the ASF Confluence space for Arrow would be the > best place for this work to happen. > > > 5. Should the user be able be able to switch between compute > > implementations at runtime, or only at compile time? > > It has been my hope to develop kernel dispatch machinery that can take > into account the execution device in addition to the input types. > Currently, we are only doing kernel selection based on input types and > other kernel parameters. If, at dispatch time / runtime, the code > indicated that the data was on the GPU, then a GPU kernel would be > selected. > > > 6. Arrow's CI doesn't currently seem to support GPUs. If a free GPU CI > > service were to come along, would Arrow be open to using it? > > Yes, I think so. Apache Spark has a Jenkins instance administered by > UC Berkeley that's integrated with their GitHub. I can imagine a > similar system where a bot will trigger builds in a GPU-enabled > Jenkins when certain conditions are met (commit message flags) or if > the developer requests. > > > > > Other than that we'd love to know where and how we can plug in and help > out! > > Thanks! Glad to have more folks involved on this. > > - Wes > > > > > Be Well > > Anthony > > -- > > > > Asst. Prof. Anthony Scopatz > > Nuclear Engineering Program > > Mechanical Engineering Dept. > > University of South Carolina > > [email protected] > > Cell: (512) 827-8239 > > Book a meeting with me at https://scopatz.youcanbook.me/ > > Open up an issue: https://github.com/scopatz/me/issues > > Check my calendar > > <https://www.google.com/calendar/embed?src=scopatz%40gmail.com> > -- Asst. Prof. Anthony Scopatz Nuclear Engineering Program Mechanical Engineering Dept. University of South Carolina [email protected] Cell: (512) 827-8239 Book a meeting with me at https://scopatz.youcanbook.me/ Open up an issue: https://github.com/scopatz/me/issues Check my calendar <https://www.google.com/calendar/embed?src=scopatz%40gmail.com>
