Finally, we solved this problem by building our own netlib-java natives so files on CentOS, it works without any warning but the performance is far from running in Macbook Pro.
The matrix size is rows: 6778, columns: 2487 The MBP used 10 seconds to get the PCA result, but CentOS used 110s, event MBP with pure BLAS java implementation will only use 40s The source code cache the input matrix in memory, and only 200+kB data read by shuffle. The only different in http://localhost:4040/stages/ are Stage Id Description Submitted Duration Tasks: Succeeded/Total Shuffle Read Shuffle Write 14 aggregate at RowMatrix.scala:211 2014/06/13 12:18:12 *36 s* 3/3 The Duration on mac is only 10s So why RowMatrix.scala perform so differently on mac and CentOS, any related to the native blas implementation? Then I reduce the matrix size to half, and the duration is reduce to 2s on mac and 12s on CentOS. Is there any benchmark available for isolation the problem in either mllib or netlib-java? cpus are 3740QM on mac and E5620 on CentOS http://ark.intel.com/compare/47925,70847 the log files are in attachment, centos.log <http://apache-spark-user-list.1001560.n3.nabble.com/file/n7551/centos.log> mac.log <http://apache-spark-user-list.1001560.n3.nabble.com/file/n7551/mac.log> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Native-library-can-not-be-loaded-when-using-Mllib-PCA-tp7042p7551.html Sent from the Apache Spark User List mailing list archive at Nabble.com.