AWS CPU Instance Numpy Benchmarking Roundup
Update: Intel MKL 2020.1 has disabled the debug mode. rip. :(
In this post, we are going to focus mostly on AWS CPU instance comparison for machine learning workloads, specifically numpy. Yes, numpy. Despite it’s a common knowledge nowadays that GPU handles machine learning and deep learning workloads much faster than CPU instances. However, depending on your work load, most of the preprocessing, postprocessing, and I/O workloads will still rely on CPU’s performance, specifically, numpy.
There are reports that indicates EYPC 7600s series runs much faster than Intel’s Xeon 8100s in floating point calculations, such as the following.
Setup
However, the local workstation performance might not translate onto AWS, especially for numpy. So here, we will compare Intel and AMD CPU instances across different type of numpy installs on AWS, speficially t3.2xlarge ($0.3328 per hour) and t3a.2xlarge ($0.3008 per hour) in terms of raw numpy matrix multiplication and matrix norm calculation. t3.2xlarge uses Intel Xeon Platinum 8000 series (Skylake-SP or Cascade Lake) and t3a.2xlarge uses AMD EPYC 7571 (Zen 1). Both instance types have 8 threads and both CPUs are now around 3 years old thus the performance difference between them cannot be extrapolated to current generation local workstations.
We will be using pip install numpy
that usually installs openblas
version of numpy and pip install intel-numpy
that installs mkl
version of numpy. And since Intel’s mkl
library blocks AMD CPU from using the optimal compute path, we will be using export MKL_DEBUG_CPU_TYPE=5
trick to make Intel’s mkl
numpy perform better.
We will also include our 7-years-old machine with 4-core/8-thread Intel Core i7-3770k and 32GB of DDR3 memory, because, why not.
Here’s the short script that we used for the simple benchmarking, taken from PugetSystems.
import time
import numpy as np
np.show_config()
np.random.seed(0)
n = 20000
A = np.random.randn(n, n).astype('float64')
B = np.random.randn(n, n).astype('float64')
start_time = time.time()
nrm = np.linalg.norm(A @ B)
print("took {} seconds ".format(time.time() - start_time))
print("norm = ",nrm)
Results
Here’s what we have found. AWS’s Intel instance is clearly the winner in numpy performance, and that is further exaggerated when we used Intel MKL numpy with pip install intel-numpy
. And even the plain openblas numpy pip install numpy
Intel is the clear winner here on AWS. Interestingly enough, we find almost no performance difference between MKL and openblas optimized numpy on AMD’s CPU. And without using the debug=5 trick, AWS’s AMD t3a instances are clearly being throttled by Intel MKL numpy, at almost 10x slower speed.
Conclusion
- Use Intel compute instances if any of the work involves numpy or based on numpy, e.g., pandas, scipy, scikit-learn, etc.
- Use Intel Python distribution, Intel MKL, or
pip install intel-numpy
. There’s also Intel optimized version of Tensorflow/Keras and Pytorch, which should help in CPU instance inferencing speed. - For AMD systems, normal
openblas
numpy works fine on AWS. On local workstation using the latest CPU it’s might be a different story.
The benchmark stdouts are as follows:
t3.2xlarge, Intel, openBLAS
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
took 104.94071435928345 seconds
norm = 2828386.9333149535
t3.2xlarge, Intel, MKL
mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
took 55.229004859924316 seconds
norm = 2828386.933314957
t3a.2xlarge, AMD, openBLAS
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
took 217.2462785243988 seconds
norm = 2828386.9333149535
t3a.2xlarge, AMD, MKL
mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
took 540.7750198841095 seconds
norm = 2828386.9333149106
t3a.2xlarge, AMD, MKL, DEBUG=5
ubuntu@:~$ export MKL_DEBUG_CPU_TYPE=5
ubuntu@:~$ python3 bench.py
mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
took 219.22319507598877 seconds
norm = 2828386.9333149563
local. Intel Core-i7 3770k, MKL
blas_mkl_info:
libraries = ['mkl_rt']
library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
blas_opt_info:
libraries = ['mkl_rt']
library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
lapack_mkl_info:
libraries = ['mkl_rt']
library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
lapack_opt_info:
libraries = ['mkl_rt']
library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
took 231.50352239608765 seconds
norm = 2828386.9333149106
To cite this content, please use:
@article{
leehanchung,
author = {Lee, Hanchung},
title = {AWS CPU Instance Numpy Benchmarking Roundup},
year = {2020},
howpublished = {\url{https://leehanchung.github.io}},
url = {https://leehanchung.github.io/AWS-numpy-benchmarking}
}