Yes it's intriguing, though as you say not readily available in the wild yet. I would also expect native BLAS to outperform f2j also, so yeah that's the interesting question, whether this is a win over native code or not. I suppose the upside is eventually, we may expect this API to be available in all JVMs, not just those with native libraries added at runtime.
I wonder if a short-term goal would be to ensure that these calls are simply abstracted away, which they should already me, so it's easy to plug in this new 'BLAS' implementation. I'm sure it's possible to load this selectively via reflection, as that's what the current libraries do. And there may be additional code paths that could benefit from these operations that don't already. On Tue, Dec 15, 2020 at 8:30 AM Ludovic Henry <luhe...@microsoft.com.invalid> wrote: > Hello, > > > > I’ve, over the past few days, looked into using the new Vector API [1] to > accelerate some BLAS operations straight from Java. You can find a gist at > [2] containing most of the changes in > mllib-local/src/main/scala/org/apache/spark/ml/linalg/BLAS.scala. > > > > To measure performance, I’ve added a BLASBenchmark.scala [3] at > mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala. > I do see some promising speedups, especially compared to F2jBLAS. I’ve > unfortunately not been able to install OpenBLAS locally and compare > performance to native, but I would still expect native to be faster, > especially on large inputs. See [4] for some f2j vs vector performance > comparison. > > > > The primary blocker is that the Vector API is only available in incubator > mode, starting with JDK 16. We can have an easy run-time check whether we > can use the Vectorized BLAS. But, to compile the Vectorized BLAS class, we > need JDK 16+. Spark 3.0+ does compile with JDK 16 (it works locally), but I > don’t know how to selectively compile sources based on the JDK version used > at compile-time. > > > > But much more importantly, I want to get your feedback before I keep > exploring this idea further. Technically, it is feasible, and we’ll observe > speed up whenever the native BLAS is not installed. Moreover, I am solely > focusing on ML/MLLib for now. However, there is still graphx (I haven’t > checked if there is anything vectorizable) and even supporting more > explicit use of the Vector API in catalyst, which is a much bigger project. > > > > Thank you, > > Ludovic Henry > > > > [1] https://openjdk.java.net/jeps/338 > > [2] > https://gist.github.com/luhenry/6b24ac146a110143ad31736caf7250e6#file-blas-scala > > [3] > https://gist.github.com/luhenry/6b24ac146a110143ad31736caf7250e6#file-blasbenchmark-scala > > [4] > https://gist.github.com/luhenry/6b24ac146a110143ad31736caf7250e6#file-f2j-vs-vector-log >