Re: Using CUDA within Spark / boosting linear algebra

Evan R. Sparks Wed, 25 Mar 2015 15:01:00 -0700

Alex - great stuff, and the nvblas numbers are pretty remarkable (almost
too good... did you check the results for correctness? - also, is it
possible that the "unified memory model" of nvblas is somehow hiding pci
transfer time?)


this last bit (getting nvblas + netlib-java to play together) sounds like
it's non-trivial and took you a while to figure out! Would you mind posting
a gist or something of maybe the shell scripts/exports you used to make
this work - I can imagine it being highly useful for others in the future.

Thanks!
Evan

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <alexander.ula...@hp.com>
wrote:

> Hi again,
>
> I finally managed to use nvblas within Spark+netlib-java. It has
> exceptional performance for big matrices with Double, faster than
> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
> original nvblas presentation on GPU conf 2013 (slide 21):
> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>
> My results:
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Just in case, these tests are not for generalization of performance of
> different libraries. I just want to pick a library that does at best dense
> matrices multiplication for my task.
>
> P.S. My previous issue with nvblas was the following: it has Fortran blas
> functions, at the same time netlib-java uses C cblas functions. So, one
> needs cblas shared library to use nvblas through netlib-java. Fedora does
> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
> could not use cblas from Atlas or Openblas because they link to their
> implementation and not to Fortran blas.
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi,
>
> I am trying to use nvblas with netlib-java from Spark. nvblas functions
> should replace current blas functions calls after executing LD_PRELOAD as
> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
> changes to netlib-java. It seems to work for simple Java example, but I
> cannot make it work with Spark. I run the following:
> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU
> Memory |
> |  GPU       PID  Type  Process name                               Usage
>     |
>
> |=============================================================================|
> |    0      8873    C   bash
> 39MiB |
> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
> 39MiB |
>
> +-----------------------------------------------------------------------------+
>
> In Spark shell I do matrix multiplication and see the following:
> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
> So I am sure that netlib-native is loaded and cblas supposedly used.
> However, matrix multiplication does executes on CPU since I see 16% of CPU
> used and 0% of GPU used. I also checked different matrix sizes, from
> 100x100 to 12000x12000
>
> Could you suggest might the LD_PRELOAD not affect Spark shell?
>
> Best regards, Alexander
>
>
>
> From: Sam Halliday [mailto:sam.halli...@gmail.com]
> Sent: Monday, March 09, 2015 6:01 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
>
> Thanks so much for following up on this!
>
> Hmm, I wonder if we should have a concerted effort to chart performance on
> various pieces of hardware...
> On 9 Mar 2015 21:08, "Ulanov, Alexander" <alexander.ula...@hp.com<mailto:
> alexander.ula...@hp.com>> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
> support of Double in the current source code), did the test with BIDMat and
> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Sam Halliday [mailto:sam.halli...@gmail.com<mailto:
> sam.halli...@gmail.com>]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org<mailto:
> dev@spark.apache.org>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> BTW, is anybody on this list going to the London Meetup in a few weeks?
>
>
> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>
> Would be nice to meet other people working on the guts of Spark! :-)
>
>
> Xiangrui Meng <men...@gmail.com<mailto:men...@gmail.com>> writes:
>
> > Hey Alexander,
> >
> > I don't quite understand the part where netlib-cublas is about 20x
> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
> > with netlib-java?
> >
> > CC'ed Sam, the author of netlib-java.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <jos...@databricks.com
> <mailto:jos...@databricks.com>> wrote:
> >> Better documentation for linking would be very helpful!  Here's a JIRA:
> >> https://issues.apache.org/jira/browse/SPARK-6019
> >>
> >>
> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>
> >> wrote:
> >>
> >>> Thanks for compiling all the data and running these benchmarks,
> >>> Alex. The big takeaways here can be seen with this chart:
> >>>
> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>
> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>> BIDMat+GPU) can provide substantial (but less than an order of
> >>> BIDMat+magnitude)
> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>> netlib-java+openblas-compiled).
> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> >>> worse than a well-tuned CPU implementation, particularly for larger
> matrices.
> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >>> basically agrees with the authors own benchmarks (
> >>> https://github.com/fommil/netlib-java)
> >>>
> >>> I think that most of our users are in a situation where using GPUs
> >>> may not be practical - although we could consider having a good GPU
> >>> backend available as an option. However, *ALL* users of MLlib could
> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
> >>> BLAS implementation. Perhaps we should consider updating the mllib
> >>> guide with a more complete section for enabling high performance
> >>> binaries on OSX and Linux? Or better, figure out a way for the
> >>> system to fetch these automatically.
> >>>
> >>> - Evan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
> >>>
> >>>> Just to summarize this thread, I was finally able to make all
> >>>> performance comparisons that we discussed. It turns out that:
> >>>> BIDMat-cublas>>BIDMat
> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
> >>>> =netlib-cublas>netlib-blas>f2jblas
> >>>>
> >>>> Below is the link to the spreadsheet with full results.
> >>>>
> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>
> >>>> One thing still needs exploration: does BIDMat-cublas perform
> >>>> copying to/from machine’s RAM?
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ulanov, Alexander
> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>> To: Evan R. Sparks
> >>>> Cc: Joseph Bradley;
> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >>>> the original one discusses slightly different topic. I was able to
> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
> >>>> statically linked inside a 60MB library.
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>> |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>>> 1569,233228 |
> >>>>
> >>>> It turn out that pre-compiled MKL is faster than precompiled
> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
> >>>> locally compiled openblas and cuda.
> >>>>
> >>>> Alexander
> >>>>
> >>>> From: Evan R. Sparks
> >>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>]
> >>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Great - perhaps we can move this discussion off-list and onto a
> >>>> JIRA ticket? (Here's one:
> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
> >>>>
> >>>> It seems like this is going to be somewhat exploratory for a while
> >>>> (and there's probably only a handful of us who really care about
> >>>> fast linear
> >>>> algebra!)
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for explanation and useful link. I am going to build
> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
> >>>>
> >>>> Do I understand correctly that BIDMat binaries contain statically
> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
> >>>> it seems that in my case precompiled MKL BLAS performs better than
> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
> to be on par with JNI overheads.
> >>>>
> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
> >>>> Halliday
> >>>> (Netlib-java) interested to compare their libraries.
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
> evan.spa...@gmail.com><mailto:
> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
> >>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.
> >>>> apache.org<mailto:dev@spark.apache.org>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >>>> from getting cache sizes, etc. set up correctly for your particular
> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
> >>>> quickly and yields performance competitive with MKL.
> >>>>
> >>>> To make sure the right library is getting used, you have to make
> >>>> sure it's first on the search path - export
> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>
> >>>> For some examples of getting netlib-java setup on an ec2 node and
> >>>> some example benchmarking code we ran a while back, see:
> >>>> https://github.com/shivaram/matrix-bench
> >>>>
> >>>> In particular - build-openblas-ec2.sh shows you how to build the
> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
> >>>> shows you how to get the path setup and get that library picked up by
> netlib-java.
> >>>>
> >>>> In this way - you could probably get cuBLAS set up to be used by
> >>>> netlib-java as well.
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> >>>> force loading the right blas? For netlib, I there are few JVM
> >>>> flags, such as
> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> >>>> so I can force it to use Java implementation. Not sure I understand
> how to force use a specific blas (not specific wrapper for blas).
> >>>>
> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
> >>>> that netlib is using it.
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
> evan.spa...@gmail.com><mailto:
> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
> >>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.
> >>>> apache.org<mailto:dev@spark.apache.org>>
> >>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Getting breeze to pick up the right blas library is critical for
> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>>> It might make sense to force BIDMat to use the same underlying BLAS
> >>>> library as well.
> >>>>
> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
> >>>> Hi Evan, Joseph
> >>>>
> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >>>> |native_system_linux_x86-64|
> >>>> Breeze+Netlib-java f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
> >>>> ||
> >>>>
> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> >>>> 19 Linux, Scala 2.11.
> >>>>
> >>>> Later I will make tests with Cuda. I need to install new Cuda
> >>>> version for this purpose.
> >>>>
> >>>> Do you have any ideas why breeze-netlib with native blas is so much
> >>>> slower than BIDMat MKL?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Joseph Bradley [mailto:jos...@databricks.com<mailto:
> jos...@databricks.com><mailto:
> >>>> jos...@databricks.com<mailto:jos...@databricks.com>>]
> >>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Evan R. Sparks;
> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.
> >>>> apache.org<mailto:dev@spark.apache.org>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting.  Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance.  If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is done
> >>>> in practice).  Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> >>>>
> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> >>>> John Canny and I am really inspired by his talk and comparisons with
> Spark MLlib.
> >>>>
> >>>> I am very interested to find out what will be better within Spark:
> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
> >>>> neural networks in batch mode. While it is not a “pure” test of
> >>>> linear algebra, it involves some other things that are essential to
> machine learning.
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
> evan.spa...@gmail.com><mailto:
> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
> >>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc:
> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.
> >>>> apache.org<mailto:dev@spark.apache.org>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> >>>> netlib-java+data
> >>>> layout and fewer levels of indirection - it's definitely a
> >>>> worthwhile experiment to run. The main speedups I've seen from
> >>>> using it come from highly optimized GPU code for linear algebra. I
> >>>> know that in the past Canny has gone as far as to write custom GPU
> >>>> kernels for performance-critical regions of code.[1]
> >>>>
> >>>> BIDMach is highly optimized for single node performance or
> >>>> performance on small clusters.[2] Once data doesn't fit easily in
> >>>> GPU memory (or can be batched in that way) the performance tends to
> >>>> fall off. Canny argues for hardware/software codesign and as such
> >>>> prefers machine configurations that are quite different than what
> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4
> GPUs.
> >>>>
> >>>> In contrast, MLlib was designed for horizontal scalability on
> >>>> commodity clusters and works best on very big datasets - order of
> terabytes.
> >>>>
> >>>> For the most part, these projects developed concurrently to address
> >>>> slightly different use cases. That said, there may be bits of
> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
> >>>> careful about maintaining cross-language compatibility for our Java
> >>>> and Python-users, though.
> >>>>
> >>>> - Evan
> >>>>
> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>>
> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >>>> you know what makes them faster than netlib-java?
> >>>>
> >>>> The same group has BIDMach library that implements machine
> >>>> learning. For some examples they use Caffe convolutional neural
> >>>> network library owned by another group in Berkeley. Could you
> >>>> elaborate on how these all might be connected with Spark Mllib? If
> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
> optimization and learning?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
> evan.spa...@gmail.com><mailto:
> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto:
> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>]
> >>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
> dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto:
> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.
> >>>> apache.org<mailto:dev@spark.apache.org>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >>>> blas in many cases.
> >>>>
> >>>> You might consider taking a look at the codepaths that BIDMat (
> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >>>> optimizing to make this work really fast from Scala. I've run it on
> >>>> my laptop and compared to MKL and in certain cases it's 10x faster at
> matrix multiply.
> >>>> There are a lot of layers of indirection here and you really want
> >>>> to avoid data copying as much as possible.
> >>>>
> >>>> We could also consider swapping out BIDMat for Breeze, but that
> >>>> would be a big project and if we can figure out how to get
> >>>> breeze+cublas to comparable performance that would be a big win.
> >>>>
> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote:
> >>>> Dear Spark developers,
> >>>>
> >>>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>>> One way of doing this is to use Scala Breeze library that is
> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
> >>>> and LAPACK native binaries if they are available on the worker
> >>>> node. It also has its own optimized Java implementation of BLAS. It
> >>>> is worth mentioning, that native binaries provide better performance
> only for BLAS level 3, i.e.
> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >>>> This is confirmed by GEMM test on Netlib-java page
> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
> >>>> experiments with training of artificial neural network
> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>>> However, I would like to boost performance more.
> >>>>
> >>>> GPU is supposed to work fast with linear algebra and there is
> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
> >>>> server with Nvidia GPU and I was able to do the following. I linked
> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
> >>>> performance measurements with regards to artificial neural network
> >>>> batch learning in Spark MLlib that involves matrix-matrix
> >>>> multiplications. It turns out that for matrices of size less than
> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
> >>>> slower for bigger matrices. It worth mentioning that it is was not a
> test for ONLY multiplication since there are other operations involved.
> >>>> One of the reasons for slowdown might be the overhead of copying
> >>>> the matrices from computer memory to graphic card memory and back.
> >>>>
> >>>> So, few questions:
> >>>> 1) Do these results with CUDA make sense?
> >>>> 2) If the problem is with copy overhead, are there any libraries
> >>>> that allow to force intermediate results to stay in graphic card
> >>>> memory thus removing the overhead?
> >>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>
> >>>> Thank you, Alexander
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> -- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto:
> dev-unsubscr...@spark.apache.org><mailto:
> >>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spark.apach
> >>>> e.org>><mailto:dev-unsubscr...@spark.apac<mailto:dev-unsubscribe@sp
> >>>> ark.apac> he.org<http://he.org>
> >>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spa
> >>>> rk.apache.org>>> For additional commands, e-mail:
> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto:
> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>><mailto:
> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto:
> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>
> --
> Best regards,
> Sam
>

Re: Using CUDA within Spark / boosting linear algebra

Reply via email to