文介绍如何在Google Cloud Platform的CentOS 7上安装TensorFlow-GPU版本来加速训练深度学习模型。

 

前期准备

开通Google Cloud Platform并购买带GPU的VM instance(推荐GPUs 1 x NVIDIA Tesla K80)https://console.cloud.google.com

VM instance操作系统选择安装CentOS

安装Python发行版本(推荐Anaconda,https://www.anaconda.com

 

安装基础开发包

Google Cloud Platform的CentOS 7默认没有安装gcc等软件包。需要手动下载epel来安装gcc。

epel下载地址:Index of /pub/epel/7/x86_64/Packages/e
https://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/e/

$ wget https://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/e/epel-release-7-11.noarch.rpm
$ sudo rpm -ivh epel-release-7-11.noarch.rpm
$ sudo yum install --enablerepo=epel dkms
$ sudo yum install kernel*
$ sudo yum install gcc*

确认gcc安装成功

$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)

安装 NVIDIA CUDA

CUDA,全称是Compute Unified Device Architecture,意即统一计算架构,是NVIDIA推出的一种整合技术,开发者可以利用NVIDIA的GeForce 8 以后的GPU和较新的Quadro GPU进行计算。 ——维基百科

利用CUDA这个平台,可以方便地使用GPU来加速程序的数据运算。GPU对于深度学习这类领域非常重要,因为其具有强大的并行计算能力和浮点运算能力。

CUDA的编程模型将CPU作为主机(Host),将GPU作为设备(Device),CPU用来控制整体调度和程序逻辑,GPU负责执行高度线程化的数据并行部分。

访问CUDA Toolkit 10.0 Download页面

CUDA Toolkit 10.0 Download | NVIDIA Developer
https://developer.nvidia.com/cuda-downloads

$ wget https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.0.130-1.x86_64.rpm
$ sudo yum install cuda-repo-rhel7-10.0.130-1.x86_64.rpm
$ sudo yum upgrade

 

安装NVIDIA cuDNN

cuDNN的全称为NVIDIA CUDA® Deep Neural Network library,是NVIDIA专门针对深度神经网络(Deep Neural Networks)中的基础操作而设计基于GPU的加速库。cuDNN为深度神经网络中的标准流程提供了高度优化的实现方式,例如convolution、pooling、normalization以及activation layers的前向以及后向过程。

访问cuDNN Download下载页面

cuDNN Download | NVIDIA Developer
https://developer.nvidia.com/rdp/cudnn-download

$ wget https://developer.download.nvidia.com/compute/machine-learning/cudnn/secure/v7.3.1/prod/10.0_2018927/cudnn-10.0-linux-x64-v7.3.1.20.tgz?YwkiKhzn58ta2p0EM_n3UhXORsYIskH0bpRiQPHkv8fH88vtZR6RWaqg_wLS1qYMUf3x6wZ5YykCIRpXP8pDUUrCKeay7xBR6rv5vf7T2zRYcnZEQvT_lMLYYASv6u7NIBEmzGtDpQtnCRXWFvkI-k16Wt62i1XInOA-63FQlBtgx2OBFOn9bM21oS9RDb7F23jB3jGXoc6knvH_mUQ4Dg

解压下载的 cudnn

$ tar -xvf cudnn-10.0-linux-x64-v7.3.1.20.tgz

进入解压后的 cuda 目录

$ cd cuda

复制 cuda 子目录文件至 /usr/local/cuda-xx.x 安装目录

$ sudo cp include/* /usr/local/cuda-10.0/include/
$ sudo cp lib64/lib* /usr/local/cuda-10.0/lib64/

编辑 /etc/profile 文件

$ sudo vi /etc/profile

在 export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL 下面一行增加

# cuda
export PATH=/usr/local/cuda-10.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH

 

检查 nvidia  GPU 驱动

安装pciutils工具包

$ sudo yum install pciutils

确认成功安装以上 nvidia 软件包

$ /usr/sbin/lspci | grep -i nvidia
00:04.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

加载 nvidia 模块

$ sudo modprobe nvidia

检查 nvidia  GPU 驱动是否正常

$ nvidia-smi
Wed Oct 31 10:29:20 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    77W / 149W |      0MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

 

安装 tensorflow-gpu

安装 tensorflow-gpu

$ conda install tensorflow-gpu

运行 tensorflow

$ python
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-10-31 10:32:04.575694: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2018-10-31 10:32:06.566120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-31 10:32:06.566530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-10-31 10:32:06.566555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-10-31 10:32:08.775430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-31 10:32:08.775524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2018-10-31 10:32:08.775535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2018-10-31 10:32:08.775807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10759 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
>>> sess.run(hello)
b'Hello, TensorFlow!'
>> from tensorflow.python.client import device_lib
>>> print(device_lib.list_local_devices())
2018-10-31 10:33:16.061424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-10-31 10:33:16.061475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-31 10:33:16.061484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2018-10-31 10:33:16.061491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2018-10-31 10:33:16.061606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/device:GPU:0 with 10759 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 14806055166838811165
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 11281927373
locality {
bus_id: 1
links {
}
}
incarnation: 14891267943998682329
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7"
]

至此,安装 tensorflow-gpu 成功,接下来可以利用GPU加速训练深度学习模型。

 

问题及解决办法

问题1. /usr/sbin/lspci: No such file or directory

$ /usr/sbin/lspci | grep -i nvidia
-bash: /usr/sbin/lspci: No such file or directory

未安装 pciutils 工具包。解决办法

$ yum provides lspci
pciutils-3.5.1-3.el7.x86_64 : PCI bus related utilities
Repo        : base
Matched from:
Filename    : /usr/sbin/

问题2. nvidia-smi: command not found

$ nvidia-smi
-bash: nvidia-smi: command not found

未正确安装 nvidia 软件包。解决办法参见上面 安装 NVIDIA CUDA 和 安装NVIDIA cuDNN

问题3. failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-10-31 09:57:58.058038: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2018-10-31 09:57:58.111365: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2018-10-31 09:57:58.111509: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:150] kernel driver does not appear to be running on this host (instance-1): /proc/driver/nvidia/version does not exist

未正确安装 nvidia 软件包。解决办法同问题2。

问题4. gcc: command not found

$ gcc --version
-bash: gcc: command not found

CentOS mininal版本未安装gcc开发软件包。解决办法参见上面 安装基础开发包

 

参考资料

TensorFlow-gpu 在 CentOS 7 下的完全安装手册 – Oh_My_Fish的博客 – CSDN博客
https://blog.csdn.net/Oh_My_Fish/article/details/78861867

在 Google Cloud Platform 上使用 GPU 和安裝深度學習相關套件 – mc.ai
https://mc.ai/在-google-cloud-platform-上使用-gpu-和安裝深度學習相關套件/

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the lat – 草亦花开的专栏 – CSDN博客https://blog.csdn.net/u013000139/article/details/72991881

使用yum查看工具lspci所在包并安装的方法 – beckdon的专栏 – CSDN博客
https://blog.csdn.net/beckdon/article/details/44199235

图片来源
http://ocadotechnology.com/blog/building-ml-models-is-hard-deploying-them-in-real-business-environments-is-harder/

打赏

Leave a Reply

Your email address will not be published. Required fields are marked *