Alex - great stuff, and the nvblas numbers are pretty remarkable (almost too good... did you check the results for correctness? - also, is it possible that the "unified memory model" of nvblas is somehow hiding pci transfer time?)
this last bit (getting nvblas + netlib-java to play together) sounds like it's non-trivial and took you a while to figure out! Would you mind posting a gist or something of maybe the shell scripts/exports you used to make this work - I can imagine it being highly useful for others in the future. Thanks! Evan On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <alexander.ula...@hp.com> wrote: > Hi again, > > I finally managed to use nvblas within Spark+netlib-java. It has > exceptional performance for big matrices with Double, faster than > BIDMat-cuda with Float. But for smaller matrices, if you will copy them > to/from GPU, OpenBlas or MKL might be a better choice. This correlates with > original nvblas presentation on GPU conf 2013 (slide 21): > http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf > > My results: > > https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing > > Just in case, these tests are not for generalization of performance of > different libraries. I just want to pick a library that does at best dense > matrices multiplication for my task. > > P.S. My previous issue with nvblas was the following: it has Fortran blas > functions, at the same time netlib-java uses C cblas functions. So, one > needs cblas shared library to use nvblas through netlib-java. Fedora does > not have cblas (but Debian and Ubuntu have), so I needed to compile it. I > could not use cblas from Atlas or Openblas because they link to their > implementation and not to Fortran blas. > > Best regards, Alexander > > -----Original Message----- > From: Ulanov, Alexander > Sent: Tuesday, March 24, 2015 6:57 PM > To: Sam Halliday > Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks > Subject: RE: Using CUDA within Spark / boosting linear algebra > > Hi, > > I am trying to use nvblas with netlib-java from Spark. nvblas functions > should replace current blas functions calls after executing LD_PRELOAD as > suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any > changes to netlib-java. It seems to work for simple Java example, but I > cannot make it work with Spark. I run the following: > export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 > env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell > --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: > > +-----------------------------------------------------------------------------+ > | Processes: GPU > Memory | > | GPU PID Type Process name Usage > | > > |=============================================================================| > | 0 8873 C bash > 39MiB | > | 0 8910 C /usr/lib/jvm/java-1.7.0/bin/java > 39MiB | > > +-----------------------------------------------------------------------------+ > > In Spark shell I do matrix multiplication and see the following: > 15/03/25 06:48:01 INFO JniLoader: successfully loaded > /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so > So I am sure that netlib-native is loaded and cblas supposedly used. > However, matrix multiplication does executes on CPU since I see 16% of CPU > used and 0% of GPU used. I also checked different matrix sizes, from > 100x100 to 12000x12000 > > Could you suggest might the LD_PRELOAD not affect Spark shell? > > Best regards, Alexander > > > > From: Sam Halliday [mailto:sam.halli...@gmail.com] > Sent: Monday, March 09, 2015 6:01 PM > To: Ulanov, Alexander > Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks > Subject: RE: Using CUDA within Spark / boosting linear algebra > > > Thanks so much for following up on this! > > Hmm, I wonder if we should have a concerted effort to chart performance on > various pieces of hardware... > On 9 Mar 2015 21:08, "Ulanov, Alexander" <alexander.ula...@hp.com<mailto: > alexander.ula...@hp.com>> wrote: > Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the > comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the > support of Double in the current source code), did the test with BIDMat and > CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. > > > https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing > > Best regards, Alexander > > -----Original Message----- > From: Sam Halliday [mailto:sam.halli...@gmail.com<mailto: > sam.halli...@gmail.com>] > Sent: Tuesday, March 03, 2015 1:54 PM > To: Xiangrui Meng; Joseph Bradley > Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org<mailto: > dev@spark.apache.org> > Subject: Re: Using CUDA within Spark / boosting linear algebra > > BTW, is anybody on this list going to the London Meetup in a few weeks? > > > https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community > > Would be nice to meet other people working on the guts of Spark! :-) > > > Xiangrui Meng <men...@gmail.com<mailto:men...@gmail.com>> writes: > > > Hey Alexander, > > > > I don't quite understand the part where netlib-cublas is about 20x > > slower than netlib-openblas. What is the overhead of using a GPU BLAS > > with netlib-java? > > > > CC'ed Sam, the author of netlib-java. > > > > Best, > > Xiangrui > > > > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <jos...@databricks.com > <mailto:jos...@databricks.com>> wrote: > >> Better documentation for linking would be very helpful! Here's a JIRA: > >> https://issues.apache.org/jira/browse/SPARK-6019 > >> > >> > >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks > >> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>> > >> wrote: > >> > >>> Thanks for compiling all the data and running these benchmarks, > >>> Alex. The big takeaways here can be seen with this chart: > >>> > >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ > >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive > >>> > >>> 1) A properly configured GPU matrix multiply implementation (e.g. > >>> BIDMat+GPU) can provide substantial (but less than an order of > >>> BIDMat+magnitude) > >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or > >>> netlib-java+openblas-compiled). > >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude > >>> worse than a well-tuned CPU implementation, particularly for larger > matrices. > >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this > >>> basically agrees with the authors own benchmarks ( > >>> https://github.com/fommil/netlib-java) > >>> > >>> I think that most of our users are in a situation where using GPUs > >>> may not be practical - although we could consider having a good GPU > >>> backend available as an option. However, *ALL* users of MLlib could > >>> benefit (potentially tremendously) from using a well-tuned CPU-based > >>> BLAS implementation. Perhaps we should consider updating the mllib > >>> guide with a more complete section for enabling high performance > >>> binaries on OSX and Linux? Or better, figure out a way for the > >>> system to fetch these automatically. > >>> > >>> - Evan > >>> > >>> > >>> > >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < > >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: > >>> > >>>> Just to summarize this thread, I was finally able to make all > >>>> performance comparisons that we discussed. It turns out that: > >>>> BIDMat-cublas>>BIDMat > >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo= > >>>> =netlib-cublas>netlib-blas>f2jblas > >>>> > >>>> Below is the link to the spreadsheet with full results. > >>>> > >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx > >>>> 378T9J5r7kwKSPkY/edit?usp=sharing > >>>> > >>>> One thing still needs exploration: does BIDMat-cublas perform > >>>> copying to/from machine’s RAM? > >>>> > >>>> -----Original Message----- > >>>> From: Ulanov, Alexander > >>>> Sent: Tuesday, February 10, 2015 2:12 PM > >>>> To: Evan R. Sparks > >>>> Cc: Joseph Bradley; > >>>> dev@spark.apache.org<mailto:dev@spark.apache.org> > >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra > >>>> > >>>> Thanks, Evan! It seems that ticket was marked as duplicate though > >>>> the original one discusses slightly different topic. I was able to > >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is > >>>> statically linked inside a 60MB library. > >>>> > >>>> |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| > >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | > >>>> > +-----------------------------------------------------------------------+ > >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | > >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 > >>>> |1,638475459 | > >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 | > >>>> 1569,233228 | > >>>> > >>>> It turn out that pre-compiled MKL is faster than precompiled > >>>> OpenBlas on my machine. Probably, I’ll add two more columns with > >>>> locally compiled openblas and cuda. > >>>> > >>>> Alexander > >>>> > >>>> From: Evan R. Sparks > >>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>] > >>>> Sent: Monday, February 09, 2015 6:06 PM > >>>> To: Ulanov, Alexander > >>>> Cc: Joseph Bradley; > >>>> dev@spark.apache.org<mailto:dev@spark.apache.org> > >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra > >>>> > >>>> Great - perhaps we can move this discussion off-list and onto a > >>>> JIRA ticket? (Here's one: > >>>> https://issues.apache.org/jira/browse/SPARK-5705) > >>>> > >>>> It seems like this is going to be somewhat exploratory for a while > >>>> (and there's probably only a handful of us who really care about > >>>> fast linear > >>>> algebra!) > >>>> > >>>> - Evan > >>>> > >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < > >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: > >>>> Hi Evan, > >>>> > >>>> Thank you for explanation and useful link. I am going to build > >>>> OpenBLAS, link it with Netlib-java and perform benchmark again. > >>>> > >>>> Do I understand correctly that BIDMat binaries contain statically > >>>> linked Intel MKL BLAS? It might be the reason why I am able to run > >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I > >>>> wonder if it is OK because Intel sells this library. Nevertheless, > >>>> it seems that in my case precompiled MKL BLAS performs better than > >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed > to be on par with JNI overheads. > >>>> > >>>> Though, it might be interesting to link Netlib-java with Intel MKL, > >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam > >>>> Halliday > >>>> (Netlib-java) interested to compare their libraries. > >>>> > >>>> Best regards, Alexander > >>>> > >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: > evan.spa...@gmail.com><mailto: > >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] > >>>> Sent: Friday, February 06, 2015 5:58 PM > >>>> > >>>> To: Ulanov, Alexander > >>>> Cc: Joseph Bradley; > >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. > >>>> apache.org<mailto:dev@spark.apache.org>> > >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra > >>>> > >>>> I would build OpenBLAS yourself, since good BLAS performance comes > >>>> from getting cache sizes, etc. set up correctly for your particular > >>>> hardware - this is often a very tricky process (see, e.g. ATLAS), > >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds > >>>> quickly and yields performance competitive with MKL. > >>>> > >>>> To make sure the right library is getting used, you have to make > >>>> sure it's first on the search path - export > >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. > >>>> > >>>> For some examples of getting netlib-java setup on an ec2 node and > >>>> some example benchmarking code we ran a while back, see: > >>>> https://github.com/shivaram/matrix-bench > >>>> > >>>> In particular - build-openblas-ec2.sh shows you how to build the > >>>> library and set up symlinks correctly, and scala/run-netlib.sh > >>>> shows you how to get the path setup and get that library picked up by > netlib-java. > >>>> > >>>> In this way - you could probably get cuBLAS set up to be used by > >>>> netlib-java as well. > >>>> > >>>> - Evan > >>>> > >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < > >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: > >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to > >>>> force loading the right blas? For netlib, I there are few JVM > >>>> flags, such as > >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, > >>>> so I can force it to use Java implementation. Not sure I understand > how to force use a specific blas (not specific wrapper for blas). > >>>> > >>>> Btw. I have installed openblas (yum install openblas), so I suppose > >>>> that netlib is using it. > >>>> > >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: > evan.spa...@gmail.com><mailto: > >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] > >>>> Sent: Friday, February 06, 2015 5:19 PM > >>>> To: Ulanov, Alexander > >>>> Cc: Joseph Bradley; > >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. > >>>> apache.org<mailto:dev@spark.apache.org>> > >>>> > >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra > >>>> > >>>> Getting breeze to pick up the right blas library is critical for > >>>> performance. I recommend using OpenBLAS (or MKL, if you already have > it). > >>>> It might make sense to force BIDMat to use the same underlying BLAS > >>>> library as well. > >>>> > >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < > >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: > >>>> Hi Evan, Joseph > >>>> > >>>> I did few matrix multiplication test and BIDMat seems to be ~10x > >>>> faster than netlib-java+breeze (sorry for weird table formatting): > >>>> > >>>> |A*B size | BIDMat MKL | Breeze+Netlib-java > >>>> |native_system_linux_x86-64| > >>>> Breeze+Netlib-java f2jblas | > >>>> > +-----------------------------------------------------------------------+ > >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | > >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | > >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 > >>>> || > >>>> > >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora > >>>> 19 Linux, Scala 2.11. > >>>> > >>>> Later I will make tests with Cuda. I need to install new Cuda > >>>> version for this purpose. > >>>> > >>>> Do you have any ideas why breeze-netlib with native blas is so much > >>>> slower than BIDMat MKL? > >>>> > >>>> Best regards, Alexander > >>>> > >>>> From: Joseph Bradley [mailto:jos...@databricks.com<mailto: > jos...@databricks.com><mailto: > >>>> jos...@databricks.com<mailto:jos...@databricks.com>>] > >>>> Sent: Thursday, February 05, 2015 5:29 PM > >>>> To: Ulanov, Alexander > >>>> Cc: Evan R. Sparks; > >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. > >>>> apache.org<mailto:dev@spark.apache.org>> > >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra > >>>> > >>>> Hi Alexander, > >>>> > >>>> Using GPUs with Spark would be very exciting. Small comment: > >>>> Concerning your question earlier about keeping data stored on the > >>>> GPU rather than having to move it between main memory and GPU > >>>> memory on each iteration, I would guess this would be critical to > >>>> getting good performance. If you could do multiple local > >>>> iterations before aggregating results, then the cost of data > >>>> movement to the GPU could be amortized (and I believe that is done > >>>> in practice). Having Spark be aware of the GPU and using it as > another part of memory sounds like a much bigger undertaking. > >>>> > >>>> Joseph > >>>> > >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < > >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: > >>>> Thank you for explanation! I’ve watched the BIDMach presentation by > >>>> John Canny and I am really inspired by his talk and comparisons with > Spark MLlib. > >>>> > >>>> I am very interested to find out what will be better within Spark: > >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a > >>>> fair way to benchmark them? Currently I do benchmarks on artificial > >>>> neural networks in batch mode. While it is not a “pure” test of > >>>> linear algebra, it involves some other things that are essential to > machine learning. > >>>> > >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: > evan.spa...@gmail.com><mailto: > >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] > >>>> Sent: Thursday, February 05, 2015 1:29 PM > >>>> To: Ulanov, Alexander > >>>> Cc: > >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. > >>>> apache.org<mailto:dev@spark.apache.org>> > >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra > >>>> > >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than > >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to > >>>> netlib-java+data > >>>> layout and fewer levels of indirection - it's definitely a > >>>> worthwhile experiment to run. The main speedups I've seen from > >>>> using it come from highly optimized GPU code for linear algebra. I > >>>> know that in the past Canny has gone as far as to write custom GPU > >>>> kernels for performance-critical regions of code.[1] > >>>> > >>>> BIDMach is highly optimized for single node performance or > >>>> performance on small clusters.[2] Once data doesn't fit easily in > >>>> GPU memory (or can be batched in that way) the performance tends to > >>>> fall off. Canny argues for hardware/software codesign and as such > >>>> prefers machine configurations that are quite different than what > >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 > GPUs. > >>>> > >>>> In contrast, MLlib was designed for horizontal scalability on > >>>> commodity clusters and works best on very big datasets - order of > terabytes. > >>>> > >>>> For the most part, these projects developed concurrently to address > >>>> slightly different use cases. That said, there may be bits of > >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be > >>>> careful about maintaining cross-language compatibility for our Java > >>>> and Python-users, though. > >>>> > >>>> - Evan > >>>> > >>>> [1] - http://arxiv.org/abs/1409.5402 [2] - > >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf > >>>> > >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < > >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: > >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote: > >>>> Hi Evan, > >>>> > >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do > >>>> you know what makes them faster than netlib-java? > >>>> > >>>> The same group has BIDMach library that implements machine > >>>> learning. For some examples they use Caffe convolutional neural > >>>> network library owned by another group in Berkeley. Could you > >>>> elaborate on how these all might be connected with Spark Mllib? If > >>>> you take BIDMat for linear algebra why don’t you take BIDMach for > optimization and learning? > >>>> > >>>> Best regards, Alexander > >>>> > >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: > evan.spa...@gmail.com><mailto: > >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto: > evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: > >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>] > >>>> Sent: Thursday, February 05, 2015 12:09 PM > >>>> To: Ulanov, Alexander > >>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto: > dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto: > >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. > >>>> apache.org<mailto:dev@spark.apache.org>>> > >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra > >>>> > >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU > >>>> blas in many cases. > >>>> > >>>> You might consider taking a look at the codepaths that BIDMat ( > >>>> https://github.com/BIDData/BIDMat) takes and comparing them to > >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work > >>>> optimizing to make this work really fast from Scala. I've run it on > >>>> my laptop and compared to MKL and in certain cases it's 10x faster at > matrix multiply. > >>>> There are a lot of layers of indirection here and you really want > >>>> to avoid data copying as much as possible. > >>>> > >>>> We could also consider swapping out BIDMat for Breeze, but that > >>>> would be a big project and if we can figure out how to get > >>>> breeze+cublas to comparable performance that would be a big win. > >>>> > >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < > >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: > >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote: > >>>> Dear Spark developers, > >>>> > >>>> I am exploring how to make linear algebra operations faster within > Spark. > >>>> One way of doing this is to use Scala Breeze library that is > >>>> bundled with Spark. For matrix operations, it employs Netlib-java > >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms) > >>>> and LAPACK native binaries if they are available on the worker > >>>> node. It also has its own optimized Java implementation of BLAS. It > >>>> is worth mentioning, that native binaries provide better performance > only for BLAS level 3, i.e. > >>>> matrix-matrix operations or general matrix multiplication (GEMM). > >>>> This is confirmed by GEMM test on Netlib-java page > >>>> https://github.com/fommil/netlib-java. I also confirmed it with my > >>>> experiments with training of artificial neural network > >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952. > >>>> However, I would like to boost performance more. > >>>> > >>>> GPU is supposed to work fast with linear algebra and there is > >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux > >>>> server with Nvidia GPU and I was able to do the following. I linked > >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put > >>>> it into Spark, so Breeze/Netlib is using it. Then I did some > >>>> performance measurements with regards to artificial neural network > >>>> batch learning in Spark MLlib that involves matrix-matrix > >>>> multiplications. It turns out that for matrices of size less than > >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes > >>>> slower for bigger matrices. It worth mentioning that it is was not a > test for ONLY multiplication since there are other operations involved. > >>>> One of the reasons for slowdown might be the overhead of copying > >>>> the matrices from computer memory to graphic card memory and back. > >>>> > >>>> So, few questions: > >>>> 1) Do these results with CUDA make sense? > >>>> 2) If the problem is with copy overhead, are there any libraries > >>>> that allow to force intermediate results to stay in graphic card > >>>> memory thus removing the overhead? > >>>> 3) Any other options to speed-up linear algebra in Spark? > >>>> > >>>> Thank you, Alexander > >>>> > >>>> ------------------------------------------------------------------- > >>>> -- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto: > dev-unsubscr...@spark.apache.org><mailto: > >>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spark.apach > >>>> e.org>><mailto:dev-unsubscr...@spark.apac<mailto:dev-unsubscribe@sp > >>>> ark.apac> he.org<http://he.org> > >>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spa > >>>> rk.apache.org>>> For additional commands, e-mail: > >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto: > >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>><mailto: > dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto: > >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>> > >>>> > >>>> > >>>> > >>>> > >>> > > -- > Best regards, > Sam >