Setting Up GPU Environment for Model Training

AUTHOR: Andre Foote

In a previous post, I laid out the process of setting up TensorFlow for CPU based model training. However, as this graph shows, the GPU is significantly faster at this process. That’s why I’m documenting my attempt to install the GPU based version of TensorFlow on an HPC.

Effectiveness of GPU vs CPU in Model Training

First Steps:

  1. Start dangerously modifying graphic card drivers straight away.
  2. Have your boss wisely recommend the merits of using Docker.
  3. Reluctantly begin researching Docker.
  4. Gradually realize the merits of using Docker.

Install the necessary prerequisites, Docker, and Nvidia-Docker which takes care of setting up the Nvidia host driver environment inside the Docker containers and a few other things.

Nvidia-Docker has the additional prerequisite of requiring an installation of an Nvidia driver and CUDA. CUDA is the API that allows the use of GPU for general computing.

Before installing new drivers, check to see whether they are already installed.

Nvidia Driver check:
$cat /proc/driver/nvidia/version    OR    nvidia-smi

CUDA check:
$cat /usr/local/cuda/version.txt    OR    nvcc --version

If these drivers do not work, remove them completely before attempting fresh installs. I used: sudo apt-get remove --purge cuda*

sudo apt-get remove --purge nvidia*

The process of installing these drivers can be found here. Complete steps 2.1-2.4 and then 3.6. For my OS, I followed the recommended specifications from TensorFlow’s own website found here. Rebooting after installing any Nvidia graphics driver is essential, don’t forget. Then test that the drivers are working using the commands mentioned above.

The next step is to pull a blank Ubuntu image using the command docker pull ubuntu

To run the image: sudo nvidia-docker run -it -p 6006:6006 -v /sharedfolder:/root/sharedfolder ubuntu:latest bash

To save the image after any modifications you make, use the docker commit command, note that images and containers can take up a lot of space. Use df -h and du -h to check how much space is left on your drive.Once inside the docker container. I began to setup the necessary libraries, packages and software:

apt update; apt install sudo -y; apt install python; apt install python-pip; apt install nano; apt install git; apt update

pip install tensorflow-gpu

#Replace the original models folder

git clone

apt install protobuf-compiler python-pil python-lxml python-tk;
pip install --user Cython;
pip install --user contextlib2;
pip install --user jupyter;
pip install --user matplotlib;

git clone
cd cocoapi/PythonAPI
cp -r pycocotools ../../models/research/

#Download the CUDA and CUDNN drivers for the ubuntu image as well and install them

dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
apt-key adv --fetch-keys
apt update
apt install cuda-10-0
cat /proc/driver/nvidia/version
tar -xzvf cudnn-10.0-linux-x64-v7.5.0.56.tgz
cp cuda/include/cudnn.h /usr/local/cuda/include
cp cuda/include/cudnn.h /usr/local/cuda-10.0/include
cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
cp cuda/lib64/libcudnn* /usr/local/cuda-10.0/lib64
chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda-10.0/lib64/libcudnn*

#Configuring paths

export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

python build
python install
protoc object_detection/protos/*.proto --python_out=.
export PYTHONPATH=$PYTHONPATH:pwd:pwd/slim


python object_detection/ --logtostderr --train_dir=object_detection/training/ --pipeline_config_path=object_detection/training/ssd_mobilenet_v1_pets.config

#Exporting model into a usable state (Replace #### with the number of the latest model.ckpt file)

python \
--input_type image_tensor \
--pipeline_config_path "training/ssd_mobilenet_v1_pets.config" \
--trained_checkpoint_prefix training/model.ckpt-#### \
--output_directory frozenModel

You can now test this model using tensorflow’s own object detection tutorial code. You’ll have to change the paths in boxes 5 and 9.

Post Views: 27