Update: Intel MKL 2020.1 has disabled the debug mode. rip. :(

In this post, we are going to focus mostly on AWS CPU instance comparison for machine learning workloads, specifically numpy. Yes, numpy. Despite it’s a common knowledge nowadays that GPU handles machine learning and deep learning workloads much faster than CPU instances. However, depending on your work load, most of the preprocessing, postprocessing, and I/O workloads will still rely on CPU’s performance, specifically, numpy.

There are reports that indicates EYPC 7600s series runs much faster than Intel’s Xeon 8100s in floating point calculations, such as the following.

# Setup

However, the local workstation performance might not translate onto AWS, especially for numpy. So here, we will compare Intel and AMD CPU instances across different type of numpy installs on AWS, speficially **t3.2xlarge** ($0.3328 per hour) and **t3a.2xlarge** ($0.3008 per hour) in terms of raw numpy matrix multiplication and matrix norm calculation. t3.2xlarge uses Intel Xeon Platinum 8000 series (Skylake-SP or Cascade Lake) and t3a.2xlarge uses AMD EPYC 7571 (Zen 1). Both instance types have 8 threads and both CPUs are now around 3 years old thus the performance difference between them cannot be extrapolated to current generation local workstations.

We will be using `pip install numpy`

that usually installs `openblas`

version of numpy and `pip install intel-numpy`

that installs `mkl`

version of numpy. And since Intel’s `mkl`

library blocks AMD CPU from using the optimal compute path, we will be using `export MKL_DEBUG_CPU_TYPE=5`

trick to make Intel’s `mkl`

numpy perform better.

We will also include our 7-years-old machine with 4-core/8-thread Intel Core i7-3770k and 32GB of DDR3 memory, because, why not.

Here’s the short script that we used for the simple benchmarking, taken from PugetSystems.

```
import time
import numpy as np
np.show_config()
np.random.seed(0)
n = 20000
A = np.random.randn(n, n).astype('float64')
B = np.random.randn(n, n).astype('float64')
start_time = time.time()
nrm = np.linalg.norm(A @ B)
print("took {} seconds ".format(time.time() - start_time))
print("norm = ",nrm)
```

# Results

Here’s what we have found. AWS’s Intel instance is clearly the winner in numpy performance, and that is further exaggerated when we used Intel MKL numpy with `pip install intel-numpy`

. And even the plain openblas numpy `pip install numpy`

Intel is the clear winner here on AWS. Interestingly enough, we find almost no performance difference between MKL and openblas optimized numpy on AMD’s CPU. And without using the debug=5 trick, AWS’s AMD t3a instances are clearly being throttled by Intel MKL numpy, at almost 10x slower speed.

# Conclusion

- Use Intel compute instances if any of the work involves numpy or based on numpy, e.g., pandas, scipy, scikit-learn, etc.
- Use Intel Python distribution, Intel MKL, or
`pip install intel-numpy`

. There’s also Intel optimized version of Tensorflow/Keras and Pytorch, which should help in CPU instance inferencing speed. - For AMD systems, normal
`openblas`

numpy works fine on AWS. On local workstation using the latest CPU it’s might be a different story.

### The benchmark stdouts are as follows:

#### t3.2xlarge, Intel, openBLAS

```
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
took 104.94071435928345 seconds
norm = 2828386.9333149535
```

#### t3.2xlarge, Intel, MKL

```
mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
took 55.229004859924316 seconds
norm = 2828386.933314957
```

#### t3a.2xlarge, AMD, openBLAS

```
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
took 217.2462785243988 seconds
norm = 2828386.9333149535
```

#### t3a.2xlarge, AMD, MKL

```
mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
took 540.7750198841095 seconds
norm = 2828386.9333149106
```

#### t3a.2xlarge, AMD, MKL, DEBUG=5

```
ubuntu@:~$ export MKL_DEBUG_CPU_TYPE=5
ubuntu@:~$ python3 bench.py
mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda1anaconda2anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda1anaconda2anaconda3/include']
took 219.22319507598877 seconds
norm = 2828386.9333149563
```

#### local. Intel Core-i7 3770k, MKL

```
blas_mkl_info:
libraries = ['mkl_rt']
library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
blas_opt_info:
libraries = ['mkl_rt']
library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
lapack_mkl_info:
libraries = ['mkl_rt']
library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
lapack_opt_info:
libraries = ['mkl_rt']
library_dirs = ['B:/miniconda/envs/intelpython3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['B:/miniconda/envs/intelpython3\\Library\\include']
took 231.50352239608765 seconds
norm = 2828386.9333149106
```