Re: GPU Acceleration of Spark Logistic Regression and Other MLlib libraries

John Canny Fri, 22 Jan 2016 11:19:04 -0800

Hi Rajesh,

FYI, we are developing our own version of BIDMach integration withSpark, and achieving large gains over Spark MLLib for both CPU and GPUcomputation. You can find the project here:

https://github.com/BIDData/BIDMach_Spark

I'm not sure I follow your comment "However, I think comparingend-to-end results would not be appropriate as we are affected bySpark's runtime costs; specifically, a single Spark function to convertRDD to arrays is very expensive and impacts our end-to-end performanceseverely (from 200+ gain for the GPU kernel to 25+ for the Spark libraryfunction). In contrast, BIDMach has a very light and efficient layerbetween their GPU kernel and the user program"

RDDs can be defined over any base class. In our Spark implementation, weuse RDDs of our matrix objects and our own high-performance SerDe topull data directly from HDFS, wrapped as a Sequencefile RDD. We getwithin 30% of the performance of BIDMach running on the nativefilesystem. e.g. for MNIST digit data, we get 300-500 GB/s per node fromthe filesystem. Spark helps with caching and data and code distribution,but we substitute the code that is compute or I/O intensive, and processdata in large chunks to get near-roofline throughput overall. That said,I dont understand why it would take any longer to convert any other RDDformat (once) to an RDD of matrices than to read the first RDD(converting and saving the latter with binary lz4 should be much fasterthan reading text or java serialized data). That read will always be alimit of performance, but I dont see why you wouldnt store data in HDFSor Spark filesystem in the faster format.


Some things to keep in mind with using GPUs in general for machine learning:

1. Dense BLAS account for a relatively small amount of overall dataanalysis in a web company (speaking as a half-timer at Yahoo right now).Sparse BLAS are far more important, and we've spent most of our timedeveloping and improving support for sparse computation on GPUs. Its alesser-known fact that GPUs have a significant main memory speedadvantage which typically gives an order-of-magnitude advantage forsparse operations which dominate most algorithms.2. To get full performance from the GPU you should do virtually all thecomputation there, so dense/sparse matrix operations, tensor operations,random number generation, transcendentals, slicing/dicing, mathoperators, sorting, merging etc. We have all those primitives,implemented both for CPU/GPU, dense and sparse matrices, single anddouble precision (while GPUs have significantly slower double-precisionperformance in most cases, *memory bandwidth* dominates sparsecomputation, and so sparse GPU double-precision arithmetic often has a*larger* speedup over CPUs than single precision), and many for integerand long fields. You can forget the CPU/GPU boundary if you think of theGPU as "the computer" and the rest of the machine as the IO system.3. We have found a high-level API (In the style of numpy or Breeze,rather than BLAS-level) is very important for programmer productivity.BIDMach now has a very large suite of learning algorithms, and the costto develop, test and deploy these is very low. Virtually all algorithmsare written with abstract matix types and operations and have beentested to work on CPU/GPU, sparse or dense inputs, single or doubleprecision arithmetic. With BIDMach_Spark we are working to erase theboundary between single-node and cluster execution by running a separatethread that synchronizes the local model in the background.4. Memory management is a headache for GPUs. You can always do explicitstorage management, but complex machine learning algorithms are, well,complex and its a distraction to have to figure out the lifetimes ofdozens of matrix objects. The converse is to take a piece of Matlab orscipy code with "a = b * c" operations and run it as is. We stronglyadvocate for the latter which requires a memory management strategy. Wehave found that caching works great for iterative ML algorithms, whereasany simple version of GC doesnt. BIDMach was built from the ground upusing this caching scheme. I'm not sure how you're handling this, sinceSpark was written assuming a GC to manage Breeze matrices.

Lastly, the single most important factor to improve Spark performance isfast distributed model updates. Spark has very limited support for this(via aggregation on the driver node), but it is already the bottleneckfor many Spark algorithms. And in our experience, its not possible withSpark's batch algorithms to get comparable statistical efficiency(number of passes over the data for given accuracy) for logisticregression say on newsgroup classification, with a state-of-artminibatch implementation, e.g. VW or BIDMach. We have tried with SparkSGD Logistic regression and LBFGS, but have not been able to get close.It would be good to see what accuracies you are finding. Most competingsystems, e.g. parameter server systems, are largely optimzed to do suchminibatch model updates. We recently published a rooflined Sparseallreduce, which you can think of as an optimized parameter server thatuses the original nodes to synchronize models. Its a pure-javaimplementation which produces near-network-limit performance on EC2 andallows distributed model updates on a timescale of fractions of a second.Huasha Zhao and John Canny *Kylix: A Sparse Allreduce for CommodityClusters * Proc. Int. Conference on Parallel Processing (ICPP 2014)(PDF) <http://www.cs.berkeley.edu/%7Ejfc/papers/14/Kylix.pdf>This is not integrated with our BIDMach_Spark system (yet), but that'sthe main goal of the project. With that in place, Spark should be ableto achieve near-optimal multi-node speedups for most state-of-the-art MLalgorithms. SO far we have implemented a batch algorithm (K-Means) whichdoesnt require minibatch updating to validate I/O and model and codedistribution in Spark (which requires every BIDMach class to beserializable, including all the GPU matrix classes which we have done),and gives the numbers I mentioned above.


-John


On 1/22/2016 7:47 AM, Rajesh Bordawekar wrote:

Hi Alexander,
We, at IBM Watson Research, are also working on GPU acceleration ofSpark, but we have taken an approach that is complimentary toIshizaki-san's direction. Our focus is to develop runtimeinfrastructure to enable multi-node multi-GPU exploitation in theSpark environment. The key goal of our approach is to enable**transparent** invocation of GPUs, without requiring the user tochange a single line of code. Users may need to add a Sparkconfiguration flag to direct the system on the GPU usage (exactsemantics are currently being debated).
Currently, we have LFBGS-based Logistic Regression model building andprediction implemented on a multi-node multi-GPU environment (themodel building is done on single node). We are using our ownimplementation of LBFGS as a baseline for the GPU code. The GPU codeused cublas (I presume that's what you meant by NVBLAS) whereverpossible, and indeed, we arrange the execution so that cublas operateson larger matrices. We are using JNI to invoke CUDA from Scala and wehave not seen any performance degradation due to JNI-based invocation.
We are in the process of implementing ADMM based distributedoptimization function, which would build the model in parallel(currently uses LBFGS as its individual kernel- can be replaced by anyother kernel as well). The ADMM function would also be accelerated ina multi-node multi-user environment. We are planning to shift toDataSets/Dataframes soon and support other Logistic regression kernelssuch as Quasi-Newton based approaches.
We have also enabled the Spark MLLib ALS algorithm to run on amulti-node multi-GPU system (ALS code also uses cublas/cusparse).Next, we will be covering additional functions for GPU exploitation,e.g., word2Vec (CBOW and Skip-gram with Negative Sampling), Glove, etc..
Regarding comparison to BIDMat/BIDMach, we have studied it in detailand have been using it as a guide on integrating GPU code with Scala.However, I think comparing end-to-end results would not be appropriateas we are affected by Spark's runtime costs; specifically, a singleSpark function to convert RDD to arrays is very expensive and impactsour end-to-end performance severely (from 200+ gain for the GPU kernelto 25+ for the Spark library function). In contrast, BIDMach has avery light and efficient layer between their GPU kernel and the userprogram.
Finally, we are building a comprehensive multi-node multi-GPU resourcemanagement and discovery component in spark. We are planning toaugment the existing Spark resource management UI to include GPUresources.
Please let me know if you have questions/comments! I will attending atthe Spark Summit East, and can meet in person to discuss any details.
-regards,
Rajesh


----- Forwarded by Randy Swanberg/Austin/IBM on 01/21/2016 09:31 PM -----

From: "Ulanov, Alexander" <alexander.ula...@hpe.com>
To: Kazuaki Ishizaki <ishiz...@jp.ibm.com>, "dev@spark.apache.org"<dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>Cc: John Canny <ca...@berkeley.edu>, "Evan R. Sparks"<evan.spa...@gmail.com>, Xiangrui Meng <men...@gmail.com>, SamHalliday <sam.halli...@gmail.com>
Date: 01/21/2016 01:16 PM
Subject: RE: Using CUDA within Spark / boosting linear algebra

------------------------------------------------------------------------



Hi Kazuaki,
Indeed, moving data to/from GPU is costly and this benchmarksummarizes the costs for moving different data sizes with regards tomatrices multiplication. These costs are paid for the convenience ofusing the standard BLAS API that Nvidia NVBLAS provides. The thing isthat there are no code changes required (in Spark), one just needs toreference BLAS implementation with the system variable. Naturally,hardware-specific implementation will always be faster than default.The benchmark results show that fact by comparing jCuda (by means ofBIDMat) and NVBLAS. However, it also shows that it worth using NVBLASfor large matrices because it can take advantage of several GPUs andit will be faster despite the copying overhead. That is also a knownthing advertised by Nvidia.
By the way, I don’t think that the column/row friendly format is anissue, because one can use transposed matrices to fit the requiredformat. I believe that is just a software preference.
My suggestion with regards to your prototype would be to makecomparisons with Spark’s implementation of logistic regression (thatdoes not take advantage of GPU) and also with BIDMach’s (that takesadvantage of GPUs). It will give the users a better understanding ofyour’s implementation performance. Currently you compare it withSpark’s example logistic regression implementation that is supposed tobe a reference for learning Spark rather than benchmarking itsperformance.
Best regards, Alexander


------------------------------------------------------
Rajesh R. Bordawekar
Research Staff Member
IBM T. J. Watson Research Center
bor...@us.ibm.com
Office: 914-945-2097

Re: GPU Acceleration of Spark Logistic Regression and Other MLlib libraries

Reply via email to