Hi PICCA, On 2019-05-24 12:01, PICCA Frederic-Emmanuel wrote: > What about ibm power9 with pocl ? > > it seems that this is better than the latest NVIDIA GPU.
The typical workload for training neural networks is linear operations such as general matrix-matrix multiplication and convolution. I know nothing about pocl, but it's hard for CPU to beat GPU in terms of these highly-parallelizable linear operations. Try a 4096x4096 multiplication and you will easily find out the difference. E.g. my CPU = I5 7440HQ (middle-end mobile CPU), GPU = Nvidia 940MX (junk) The junk GPU (CUDA) is 100x faster than my CPU (MKL). ~ ❯❯❯ optirun ipython3 Python 3.7.3 (default, Apr 3 2019, 05:39:12) Type 'copyright', 'credits' or 'license' for more information IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import torch as th In [2]: x = th.rand(4096, 4096) In [3]: %time x@x CPU times: user 1.65 s, sys: 38.7 ms, total: 1.69 s Wall time: 449 ms Out[3]: tensor([[1015.7596, 1004.2767, 1001.6245, ..., 1026.8447, 996.3105, 1002.7847], [1047.8833, 1014.3856, 1020.8246, ..., 1055.3224, 1021.6126, 1031.0334], [1049.3168, 1027.7637, 1030.9961, ..., 1054.3218, 1015.3804, 1031.6709], ..., [1039.6516, 1024.6678, 1021.1326, ..., 1047.0674, 1015.1402, 1029.5969], [1020.1988, 994.0073, 1005.5823, ..., 1015.6786, 990.2491, 1008.1358], [1022.9388, 991.9886, 990.4608, ..., 1013.9000, 998.8676, 1007.8554]]) In [4]: x = x.cuda() In [5]: %time x@x CPU times: user 1.1 ms, sys: 174 µs, total: 1.27 ms Wall time: 2.67 ms Out[5]: tensor([[1015.7591, 1004.2764, 1001.6254, ..., 1026.8447, 996.3105, 1002.7841], [1047.8838, 1014.3846, 1020.8243, ..., 1055.3209, 1021.6123, 1031.0328], [1049.3174, 1027.7644, 1030.9971, ..., 1054.3210, 1015.3800, 1031.6727], ..., [1039.6511, 1024.6686, 1021.1323, ..., 1047.0674, 1015.1404, 1029.5974], [1020.1982, 994.0067, 1005.5826, ..., 1015.6784, 990.2482, 1008.1347], [1022.9395, 991.9879, 990.4588, ..., 1013.9014, 998.8687, 1007.8544]], device='cuda:0')