Hi Rajesh,
FYI, we are developing our own version of BIDMach integration with
Spark, and achieving large gains over Spark MLLib for both CPU and GPU
computation. You can find the project here:
https://github.com/BIDData/BIDMach_Spark
I'm not sure I follow your comment "However, I think comparing
end-to-end results would not be appropriate as we are affected by
Spark's runtime costs; specifically, a single Spark function to convert
RDD to arrays is very expensive and impacts our end-to-end performance
severely (from 200+ gain for the GPU kernel to 25+ for the Spark library
function). In contrast, BIDMach has a very light and efficient layer
between their GPU kernel and the user program"
RDDs can be defined over any base class. In our Spark implementation, we
use RDDs of our matrix objects and our own high-performance SerDe to
pull data directly from HDFS, wrapped as a Sequencefile RDD. We get
within 30% of the performance of BIDMach running on the native
filesystem. e.g. for MNIST digit data, we get 300-500 GB/s per node from
the filesystem. Spark helps with caching and data and code distribution,
but we substitute the code that is compute or I/O intensive, and process
data in large chunks to get near-roofline throughput overall. That said,
I dont understand why it would take any longer to convert any other RDD
format (once) to an RDD of matrices than to read the first RDD
(converting and saving the latter with binary lz4 should be much faster
than reading text or java serialized data). That read will always be a
limit of performance, but I dont see why you wouldnt store data in HDFS
or Spark filesystem in the faster format.
Some things to keep in mind with using GPUs in general for machine learning:
1. Dense BLAS account for a relatively small amount of overall data
analysis in a web company (speaking as a half-timer at Yahoo right now).
Sparse BLAS are far more important, and we've spent most of our time
developing and improving support for sparse computation on GPUs. Its a
lesser-known fact that GPUs have a significant main memory speed
advantage which typically gives an order-of-magnitude advantage for
sparse operations which dominate most algorithms.
2. To get full performance from the GPU you should do virtually all the
computation there, so dense/sparse matrix operations, tensor operations,
random number generation, transcendentals, slicing/dicing, math
operators, sorting, merging etc. We have all those primitives,
implemented both for CPU/GPU, dense and sparse matrices, single and
double precision (while GPUs have significantly slower double-precision
performance in most cases, *memory bandwidth* dominates sparse
computation, and so sparse GPU double-precision arithmetic often has a
*larger* speedup over CPUs than single precision), and many for integer
and long fields. You can forget the CPU/GPU boundary if you think of the
GPU as "the computer" and the rest of the machine as the IO system.
3. We have found a high-level API (In the style of numpy or Breeze,
rather than BLAS-level) is very important for programmer productivity.
BIDMach now has a very large suite of learning algorithms, and the cost
to develop, test and deploy these is very low. Virtually all algorithms
are written with abstract matix types and operations and have been
tested to work on CPU/GPU, sparse or dense inputs, single or double
precision arithmetic. With BIDMach_Spark we are working to erase the
boundary between single-node and cluster execution by running a separate
thread that synchronizes the local model in the background.
4. Memory management is a headache for GPUs. You can always do explicit
storage management, but complex machine learning algorithms are, well,
complex and its a distraction to have to figure out the lifetimes of
dozens of matrix objects. The converse is to take a piece of Matlab or
scipy code with "a = b * c" operations and run it as is. We strongly
advocate for the latter which requires a memory management strategy. We
have found that caching works great for iterative ML algorithms, whereas
any simple version of GC doesnt. BIDMach was built from the ground up
using this caching scheme. I'm not sure how you're handling this, since
Spark was written assuming a GC to manage Breeze matrices.
Lastly, the single most important factor to improve Spark performance is
fast distributed model updates. Spark has very limited support for this
(via aggregation on the driver node), but it is already the bottleneck
for many Spark algorithms. And in our experience, its not possible with
Spark's batch algorithms to get comparable statistical efficiency
(number of passes over the data for given accuracy) for logistic
regression say on newsgroup classification, with a state-of-art
minibatch implementation, e.g. VW or BIDMach. We have tried with Spark
SGD Logistic regression and LBFGS, but have not been able to get close.
It would be good to see what accuracies you are finding. Most competing
systems, e.g. parameter server systems, are largely optimzed to do such
minibatch model updates. We recently published a rooflined Sparse
allreduce, which you can think of as an optimized parameter server that
uses the original nodes to synchronize models. Its a pure-java
implementation which produces near-network-limit performance on EC2 and
allows distributed model updates on a timescale of fractions of a second.
Huasha Zhao and John Canny *Kylix: A Sparse Allreduce for Commodity
Clusters * Proc. Int. Conference on Parallel Processing (ICPP 2014)
(PDF) <http://www.cs.berkeley.edu/%7Ejfc/papers/14/Kylix.pdf>
This is not integrated with our BIDMach_Spark system (yet), but that's
the main goal of the project. With that in place, Spark should be able
to achieve near-optimal multi-node speedups for most state-of-the-art ML
algorithms. SO far we have implemented a batch algorithm (K-Means) which
doesnt require minibatch updating to validate I/O and model and code
distribution in Spark (which requires every BIDMach class to be
serializable, including all the GPU matrix classes which we have done),
and gives the numbers I mentioned above.
-John
On 1/22/2016 7:47 AM, Rajesh Bordawekar wrote:
Hi Alexander,
We, at IBM Watson Research, are also working on GPU acceleration of
Spark, but we have taken an approach that is complimentary to
Ishizaki-san's direction. Our focus is to develop runtime
infrastructure to enable multi-node multi-GPU exploitation in the
Spark environment. The key goal of our approach is to enable
**transparent** invocation of GPUs, without requiring the user to
change a single line of code. Users may need to add a Spark
configuration flag to direct the system on the GPU usage (exact
semantics are currently being debated).
Currently, we have LFBGS-based Logistic Regression model building and
prediction implemented on a multi-node multi-GPU environment (the
model building is done on single node). We are using our own
implementation of LBFGS as a baseline for the GPU code. The GPU code
used cublas (I presume that's what you meant by NVBLAS) wherever
possible, and indeed, we arrange the execution so that cublas operates
on larger matrices. We are using JNI to invoke CUDA from Scala and we
have not seen any performance degradation due to JNI-based invocation.
We are in the process of implementing ADMM based distributed
optimization function, which would build the model in parallel
(currently uses LBFGS as its individual kernel- can be replaced by any
other kernel as well). The ADMM function would also be accelerated in
a multi-node multi-user environment. We are planning to shift to
DataSets/Dataframes soon and support other Logistic regression kernels
such as Quasi-Newton based approaches.
We have also enabled the Spark MLLib ALS algorithm to run on a
multi-node multi-GPU system (ALS code also uses cublas/cusparse).
Next, we will be covering additional functions for GPU exploitation,
e.g., word2Vec (CBOW and Skip-gram with Negative Sampling), Glove, etc..
Regarding comparison to BIDMat/BIDMach, we have studied it in detail
and have been using it as a guide on integrating GPU code with Scala.
However, I think comparing end-to-end results would not be appropriate
as we are affected by Spark's runtime costs; specifically, a single
Spark function to convert RDD to arrays is very expensive and impacts
our end-to-end performance severely (from 200+ gain for the GPU kernel
to 25+ for the Spark library function). In contrast, BIDMach has a
very light and efficient layer between their GPU kernel and the user
program.
Finally, we are building a comprehensive multi-node multi-GPU resource
management and discovery component in spark. We are planning to
augment the existing Spark resource management UI to include GPU
resources.
Please let me know if you have questions/comments! I will attending at
the Spark Summit East, and can meet in person to discuss any details.
-regards,
Rajesh
----- Forwarded by Randy Swanberg/Austin/IBM on 01/21/2016 09:31 PM -----
From: "Ulanov, Alexander" <alexander.ula...@hpe.com>
To: Kazuaki Ishizaki <ishiz...@jp.ibm.com>, "dev@spark.apache.org"
<dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>
Cc: John Canny <ca...@berkeley.edu>, "Evan R. Sparks"
<evan.spa...@gmail.com>, Xiangrui Meng <men...@gmail.com>, Sam
Halliday <sam.halli...@gmail.com>
Date: 01/21/2016 01:16 PM
Subject: RE: Using CUDA within Spark / boosting linear algebra
------------------------------------------------------------------------
Hi Kazuaki,
Indeed, moving data to/from GPU is costly and this benchmark
summarizes the costs for moving different data sizes with regards to
matrices multiplication. These costs are paid for the convenience of
using the standard BLAS API that Nvidia NVBLAS provides. The thing is
that there are no code changes required (in Spark), one just needs to
reference BLAS implementation with the system variable. Naturally,
hardware-specific implementation will always be faster than default.
The benchmark results show that fact by comparing jCuda (by means of
BIDMat) and NVBLAS. However, it also shows that it worth using NVBLAS
for large matrices because it can take advantage of several GPUs and
it will be faster despite the copying overhead. That is also a known
thing advertised by Nvidia.
By the way, I don’t think that the column/row friendly format is an
issue, because one can use transposed matrices to fit the required
format. I believe that is just a software preference.
My suggestion with regards to your prototype would be to make
comparisons with Spark’s implementation of logistic regression (that
does not take advantage of GPU) and also with BIDMach’s (that takes
advantage of GPUs). It will give the users a better understanding of
your’s implementation performance. Currently you compare it with
Spark’s example logistic regression implementation that is supposed to
be a reference for learning Spark rather than benchmarking its
performance.
Best regards, Alexander
------------------------------------------------------
Rajesh R. Bordawekar
Research Staff Member
IBM T. J. Watson Research Center
bor...@us.ibm.com
Office: 914-945-2097