Update: Intel MKL 2020.1 has disabled the debug mode. rip. :(

In this post, we are going to focus mostly on AWS CPU instance comparison for machine learning workloads, specifically numpy. Yes, numpy. Despite it’s a common knowledge nowadays that GPU handles machine learning and deep learning workloads much faster than CPU instances. However, depending on your work load, most of the preprocessing, postprocessing, and I/O workloads will still rely on CPU’s performance, specifically, numpy.

There are reports that indicates EYPC 7600s series runs much faster than Intel’s Xeon 8100s in floating point calculations, such as the following.

EYPC vs Xeon

Setup

However, the local workstation performance might not translate onto AWS, especially for numpy. So here, we will compare Intel and AMD CPU instances across different type of numpy installs on AWS, speficially t3.2xlarge ($0.3328 per hour) and t3a.2xlarge ($0.3008 per hour) in terms of raw numpy matrix multiplication and matrix norm calculation. t3.2xlarge uses Intel Xeon Platinum 8000 series (Skylake-SP or Cascade Lake) and t3a.2xlarge uses AMD EPYC 7571 (Zen 1). Both instance types have 8 threads and both CPUs are now around 3 years old thus the performance difference between them cannot be extrapolated to current generation local workstations.

We will be using pip install numpy that usually installs openblas version of numpy and pip install intel-numpy that installs mkl version of numpy. And since Intel’s mkl library blocks AMD CPU from using the optimal compute path, we will be using export MKL_DEBUG_CPU_TYPE=5 trick to make Intel’s mkl numpy perform better.

We will also include our 7-years-old machine with 4-core/8-thread Intel Core i7-3770k and 32GB of DDR3 memory, because, why not.

Here’s the short script that we used for the simple benchmarking, taken from PugetSystems.

import time
import numpy as np


np.show_config()
np.random.seed(0)
n = 20000

A = np.random.randn(n, n).astype('float64')
B = np.random.randn(n, n).astype('float64')
start_time = time.time()
nrm = np.linalg.norm(A @ B)
print("took {} seconds ".format(time.time() - start_time))
print("norm = ",nrm)

Results

Here’s what we have found. AWS’s Intel instance is clearly the winner in numpy performance, and that is further exaggerated when we used Intel MKL numpy with pip install intel-numpy. And even the plain openblas numpy pip install numpy Intel is the clear winner here on AWS. Interestingly enough, we find almost no performance difference between MKL and openblas optimized numpy on AMD’s CPU. And without using the debug=5 trick, AWS’s AMD t3a instances are clearly being throttled by Intel MKL numpy, at almost 10x slower speed.

benchmark

Conclusion

  • Use Intel compute instances if any of the work involves numpy or based on numpy, e.g., pandas, scipy, scikit-learn, etc.
  • Use Intel Python distribution, Intel MKL, or pip install intel-numpy. There’s also Intel optimized version of Tensorflow/Keras and Pytorch, which should help in CPU instance inferencing speed.
  • For AMD systems, normal openblas numpy works fine on AWS. On local workstation using the latest CPU it’s might be a different story.

The benchmark stdouts are as follows:

t3.2xlarge, Intel, openBLAS

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
took 104.94071435928345 seconds
norm =  2828386.9333149535

t3.2xlarge, Intel, MKL

mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
took 55.229004859924316 seconds
norm =  2828386.933314957

t3a.2xlarge, AMD, openBLAS

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
took 217.2462785243988 seconds 
norm =  2828386.9333149535

t3a.2xlarge, AMD, MKL

mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
took 540.7750198841095 seconds 
norm =  2828386.9333149106

t3a.2xlarge, AMD, MKL, DEBUG=5

ubuntu@:~$ export MKL_DEBUG_CPU_TYPE=5
ubuntu@:~$ python3 bench.py 
mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
took 219.22319507598877 seconds 
norm =  2828386.9333149563

local. Intel Core-i7 3770k, MKL

blas_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
blas_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
lapack_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
lapack_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
took 231.50352239608765 seconds
norm =  2828386.9333149106

To cite this content, please use:

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {AWS CPU Instance Numpy Benchmarking Roundup},
    year = {2020},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/AWS-numpy-benchmarking}
}