I believe one of the higher level goals of Spark MLlib should be to improve the efficiency of the ML algorithms that already exist. Currently there ML has a reasonable coverage of the important core algorithms. The work to get to feature parity for DataFrame-based API and model persistence are also important. Apache Spark needs to use higher-level BLAS3 and LAPACK routines, instead of BLAS1 & BLAS3. For a long time we've used the concept of compute intensity (compute_intensity = FP_operations/Word) to help look at the performance of the underling compute kernels (see the papers referenced below). It has been proven in many implementations that performance, scalability, and huge reduction in memory pressure can be achieved by using higher-level BLAS3 or LAPACK routines in both single node as well as distributed computations. I performed a survey of some of Apache Spark's ML algorithms. Unfortunately most of the ML algorithms are implemented with BLAS1 or BLAS2 routines which have very low compute intensity. BLAS2 and BLAS1 routines require a lot more memory bandwidth and will not achieve peak performance on x86, GPUs, or any other processor. Apache Spark 2.1.0 ML routines & BLAS Routines ALS(Alternating Least Squares matrix factorization BLAS2: _SPR, _TPSV BLAS1: _AXPY, _DOT, _SCAL, _NRM2 Logistic regression classification BLAS2: _GEMV BLAS1: _DOT, _SCAL Generalized linear regression BLAS1: _DOT Gradient-boosted tree regression BLAS1: _DOT GraphX SVD++ BLAS1: _AXPY, _DOT,_SCAL Neural Net Multi-layer Perceptron BLAS3: _GEMM BLAS2: _GEMV Only the Neural Net Multi-layer Perceptron uses BLAS3 matrix multiply (DGEMM). BTW the underscores are replaced by S, D, Z, C for (32-bit real, 64-bit double, 32-bit complex, 64-bit complex operations; respectably). Refactoring the algorithms to use BLAS3 routines or higher level LAPACK routines will require coding changes to use sub-block algorithms but the performance benefits can be great. More at: https://blogs.oracle.com/BestPerf/entry/improving_algorithms_in_spark_ml <https://blogs.oracle.com/BestPerf/entry/improving_algorithms_in_spark_ml> Background: Brad Carlile. Parallelism, compute intensity, and data vectorization. SuperComputing'93, November 1993. <https://blogs.oracle.com/BestPerf/resource/Carlile-app_compute-intensity-1993.pdf> John McCalpin. 213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers 1995 <https://www.researchgate.net/publication/213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers>
-- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-mission-and-goals-tp20715p20754.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.