For most machine learning workloads, Docker is helpful but not a guarantee for production/staging/development environment parity. Hell, Docker is barely enough to guarantee the reproduction of numerical outputs. The reason is Docker daemon sits on top of the physical operating system for virtualization. Thus, Docker daemon is bounded by its underlying drivers, OS, and then the hardware.

For example, AMD and Intel, while both x86-64, have different path and their respective BLAS. And just recently, Apple’s M1 CPU, an ARM based CPU, has trouble numerically replicating codes written for x86-64s. And specifically for deep learning, Tensorflow does NOT package CUDA and CuDNN with its binaries if you install from pip. And sometimes it might produce some fairly interesting errors due to the compatibility issues. For example, I had the issue of BiDirectionalLSTM bugging out for ‘no reason’.

While Tensorflow “generously” listed Tensorflow Tested build configurations, they did not pin the patch versions. And of course, using the wrong patch version break things. So here, I am keeping track of the results of hours of installing and uninstalling drivers.

Known Combo of Tensorflow, CUDA, CuDNN, and nVidia Driver Versions

Known working CUDA, CuDNN, and Driver versions.

Tensorflow 2.5.0 | CUDA Version 11.2.1 | CuDNN Version 8.1.1 | Driver Version 470.25 (Windows 10)

Tensorflow 2.2.0 | CUDA Version 10.1.0 | CuDNN Version 7.6.5 | Driver Version 431.86 (Windows 10)
  • Cuda 11.2.1 and CuDNN 8.1.1 should also fix the Ampere GPU (RTX 30x0) issues for deep learning.

Other Solutions

Pytorch has packaged the required parts of CUDA and CuDNN packaged with its binary so it is less likely to be an issue, although Pytorch is still treating Windows user as second tier citizens. And for Ubuntu based machines, Lambda Stack is awesome and should be how people actually distribute softwares.

References:

Tensorflow Tested build configurations

BiDirectionalLSTM-accelerated LSTMs/GRUs crash randomly with: [ InternalError: [Derived] Failed to call ThenRnnBackward with model config ]


To cite this content, please use:

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Feature Selection and Dimensionality Reduction},
    year = {2021},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/2021-05-20-tensorflow-and-friends/}
}