Hi Rajesh,
FYI, we are developing our own version of BIDMach integration with Spark, and achieving large gains over Spark MLLib for both CPU and GPU computation. You can find the project here:
https://github.com/BIDData/BIDMach_Spark

I'm not sure I follow your comment "However, I think comparing end-to-end results would not be appropriate as we are affected by Spark's runtime costs; specifically, a single Spark function to convert RDD to arrays is very expensive and impacts our end-to-end performance severely (from 200+ gain for the GPU kernel to 25+ for the Spark library function). In contrast, BIDMach has a very light and efficient layer between their GPU kernel and the user program"

RDDs can be defined over any base class. In our Spark implementation, we use RDDs of our matrix objects and our own high-performance SerDe to pull data directly from HDFS, wrapped as a Sequencefile RDD. We get within 30% of the performance of BIDMach running on the native filesystem. e.g. for MNIST digit data, we get 300-500 GB/s per node from the filesystem. Spark helps with caching and data and code distribution, but we substitute the code that is compute or I/O intensive, and process data in large chunks to get near-roofline throughput overall. That said, I dont understand why it would take any longer to convert any other RDD format (once) to an RDD of matrices than to read the first RDD (converting and saving the latter with binary lz4 should be much faster than reading text or java serialized data). That read will always be a limit of performance, but I dont see why you wouldnt store data in HDFS or Spark filesystem in the faster format.

Some things to keep in mind with using GPUs in general for machine learning:
1. Dense BLAS account for a relatively small amount of overall data analysis in a web company (speaking as a half-timer at Yahoo right now). Sparse BLAS are far more important, and we've spent most of our time developing and improving support for sparse computation on GPUs. Its a lesser-known fact that GPUs have a significant main memory speed advantage which typically gives an order-of-magnitude advantage for sparse operations which dominate most algorithms. 2. To get full performance from the GPU you should do virtually all the computation there, so dense/sparse matrix operations, tensor operations, random number generation, transcendentals, slicing/dicing, math operators, sorting, merging etc. We have all those primitives, implemented both for CPU/GPU, dense and sparse matrices, single and double precision (while GPUs have significantly slower double-precision performance in most cases, *memory bandwidth* dominates sparse computation, and so sparse GPU double-precision arithmetic often has a *larger* speedup over CPUs than single precision), and many for integer and long fields. You can forget the CPU/GPU boundary if you think of the GPU as "the computer" and the rest of the machine as the IO system. 3. We have found a high-level API (In the style of numpy or Breeze, rather than BLAS-level) is very important for programmer productivity. BIDMach now has a very large suite of learning algorithms, and the cost to develop, test and deploy these is very low. Virtually all algorithms are written with abstract matix types and operations and have been tested to work on CPU/GPU, sparse or dense inputs, single or double precision arithmetic. With BIDMach_Spark we are working to erase the boundary between single-node and cluster execution by running a separate thread that synchronizes the local model in the background. 4. Memory management is a headache for GPUs. You can always do explicit storage management, but complex machine learning algorithms are, well, complex and its a distraction to have to figure out the lifetimes of dozens of matrix objects. The converse is to take a piece of Matlab or scipy code with "a = b * c" operations and run it as is. We strongly advocate for the latter which requires a memory management strategy. We have found that caching works great for iterative ML algorithms, whereas any simple version of GC doesnt. BIDMach was built from the ground up using this caching scheme. I'm not sure how you're handling this, since Spark was written assuming a GC to manage Breeze matrices.

Lastly, the single most important factor to improve Spark performance is fast distributed model updates. Spark has very limited support for this (via aggregation on the driver node), but it is already the bottleneck for many Spark algorithms. And in our experience, its not possible with Spark's batch algorithms to get comparable statistical efficiency (number of passes over the data for given accuracy) for logistic regression say on newsgroup classification, with a state-of-art minibatch implementation, e.g. VW or BIDMach. We have tried with Spark SGD Logistic regression and LBFGS, but have not been able to get close. It would be good to see what accuracies you are finding. Most competing systems, e.g. parameter server systems, are largely optimzed to do such minibatch model updates. We recently published a rooflined Sparse allreduce, which you can think of as an optimized parameter server that uses the original nodes to synchronize models. Its a pure-java implementation which produces near-network-limit performance on EC2 and allows distributed model updates on a timescale of fractions of a second. Huasha Zhao and John Canny *Kylix: A Sparse Allreduce for Commodity Clusters * Proc. Int. Conference on Parallel Processing (ICPP 2014) (PDF) <http://www.cs.berkeley.edu/%7Ejfc/papers/14/Kylix.pdf> This is not integrated with our BIDMach_Spark system (yet), but that's the main goal of the project. With that in place, Spark should be able to achieve near-optimal multi-node speedups for most state-of-the-art ML algorithms. SO far we have implemented a batch algorithm (K-Means) which doesnt require minibatch updating to validate I/O and model and code distribution in Spark (which requires every BIDMach class to be serializable, including all the GPU matrix classes which we have done), and gives the numbers I mentioned above.

-John


On 1/22/2016 7:47 AM, Rajesh Bordawekar wrote:

Hi Alexander,

We, at IBM Watson Research, are also working on GPU acceleration of Spark, but we have taken an approach that is complimentary to Ishizaki-san's direction. Our focus is to develop runtime infrastructure to enable multi-node multi-GPU exploitation in the Spark environment. The key goal of our approach is to enable **transparent** invocation of GPUs, without requiring the user to change a single line of code. Users may need to add a Spark configuration flag to direct the system on the GPU usage (exact semantics are currently being debated).

Currently, we have LFBGS-based Logistic Regression model building and prediction implemented on a multi-node multi-GPU environment (the model building is done on single node). We are using our own implementation of LBFGS as a baseline for the GPU code. The GPU code used cublas (I presume that's what you meant by NVBLAS) wherever possible, and indeed, we arrange the execution so that cublas operates on larger matrices. We are using JNI to invoke CUDA from Scala and we have not seen any performance degradation due to JNI-based invocation.

We are in the process of implementing ADMM based distributed optimization function, which would build the model in parallel (currently uses LBFGS as its individual kernel- can be replaced by any other kernel as well). The ADMM function would also be accelerated in a multi-node multi-user environment. We are planning to shift to DataSets/Dataframes soon and support other Logistic regression kernels such as Quasi-Newton based approaches.

We have also enabled the Spark MLLib ALS algorithm to run on a multi-node multi-GPU system (ALS code also uses cublas/cusparse). Next, we will be covering additional functions for GPU exploitation, e.g., word2Vec (CBOW and Skip-gram with Negative Sampling), Glove, etc..

Regarding comparison to BIDMat/BIDMach, we have studied it in detail and have been using it as a guide on integrating GPU code with Scala. However, I think comparing end-to-end results would not be appropriate as we are affected by Spark's runtime costs; specifically, a single Spark function to convert RDD to arrays is very expensive and impacts our end-to-end performance severely (from 200+ gain for the GPU kernel to 25+ for the Spark library function). In contrast, BIDMach has a very light and efficient layer between their GPU kernel and the user program.

Finally, we are building a comprehensive multi-node multi-GPU resource management and discovery component in spark. We are planning to augment the existing Spark resource management UI to include GPU resources.

Please let me know if you have questions/comments! I will attending at the Spark Summit East, and can meet in person to discuss any details.

-regards,
Rajesh


----- Forwarded by Randy Swanberg/Austin/IBM on 01/21/2016 09:31 PM -----

From: "Ulanov, Alexander" <alexander.ula...@hpe.com>
To: Kazuaki Ishizaki <ishiz...@jp.ibm.com>, "dev@spark.apache.org" <dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com> Cc: John Canny <ca...@berkeley.edu>, "Evan R. Sparks" <evan.spa...@gmail.com>, Xiangrui Meng <men...@gmail.com>, Sam Halliday <sam.halli...@gmail.com>
Date: 01/21/2016 01:16 PM
Subject: RE: Using CUDA within Spark / boosting linear algebra

------------------------------------------------------------------------



Hi Kazuaki,

Indeed, moving data to/from GPU is costly and this benchmark summarizes the costs for moving different data sizes with regards to matrices multiplication. These costs are paid for the convenience of using the standard BLAS API that Nvidia NVBLAS provides. The thing is that there are no code changes required (in Spark), one just needs to reference BLAS implementation with the system variable. Naturally, hardware-specific implementation will always be faster than default. The benchmark results show that fact by comparing jCuda (by means of BIDMat) and NVBLAS. However, it also shows that it worth using NVBLAS for large matrices because it can take advantage of several GPUs and it will be faster despite the copying overhead. That is also a known thing advertised by Nvidia.

By the way, I don’t think that the column/row friendly format is an issue, because one can use transposed matrices to fit the required format. I believe that is just a software preference.

My suggestion with regards to your prototype would be to make comparisons with Spark’s implementation of logistic regression (that does not take advantage of GPU) and also with BIDMach’s (that takes advantage of GPUs). It will give the users a better understanding of your’s implementation performance. Currently you compare it with Spark’s example logistic regression implementation that is supposed to be a reference for learning Spark rather than benchmarking its performance.

Best regards, Alexander


------------------------------------------------------
Rajesh R. Bordawekar
Research Staff Member
IBM T. J. Watson Research Center
bor...@us.ibm.com
Office: 914-945-2097

Reply via email to