Setting Up GPU Environment for Model Training
In a previous post, I laid out the process of setting up TensorFlow for CPU based model training. However, as this graph shows, the GPU is significantly faster at this process. That’s why I’m documenting my attempt to install the GPU based version of TensorFlow on an HPC.
- Start dangerously modifying graphic card drivers straight away.
- Have your boss wisely recommend the merits of using Docker.
- Reluctantly begin researching Docker.
- Gradually realize the merits of using Docker.
Nvidia-Docker has the additional prerequisite of requiring an installation of an Nvidia driver and CUDA. CUDA is the API that allows the use of GPU for general computing.
Before installing new drivers, check to see whether they are already installed.
Nvidia Driver check:
$cat /proc/driver/nvidia/version OR nvidia-smi
$cat /usr/local/cuda/version.txt OR nvcc --version
If these drivers do not work, remove them completely before attempting fresh installs. I used:
sudo apt-get remove --purge cuda*
sudo apt-get remove --purge nvidia*
The process of installing these drivers can be found here. Complete steps 2.1-2.4 and then 3.6. For my OS, I followed the recommended specifications from TensorFlow’s own website found here. Rebooting after installing any Nvidia graphics driver is essential, don’t forget. Then test that the drivers are working using the commands mentioned above.
The next step is to pull a blank Ubuntu image using the command
docker pull ubuntu
To run the image:
sudo nvidia-docker run -it -p 6006:6006 -v /sharedfolder:/root/sharedfolder ubuntu:latest bash
To save the image after any modifications you make, use the docker commit command, note that images and containers can take up a lot of space. Use
df -h and
du -h to check how much space is left on your drive.Once inside the docker container. I began to setup the necessary libraries, packages and software:
apt update; apt install sudo -y; apt install python; apt install python-pip; apt install nano; apt install git; apt update
pip install tensorflow-gpu
#Replace the original models folder
git clone https://github.com/tensorflow/models
apt install protobuf-compiler python-pil python-lxml python-tk;
pip install --user Cython;
pip install --user contextlib2;
pip install --user jupyter;
pip install --user matplotlib;
git clone https://github.com/cocodataset/cocoapi.git
cp -r pycocotools ../../models/research/
#Download the CUDA and CUDNN drivers for the ubuntu image as well and install them
dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt install cuda-10-0
tar -xzvf cudnn-10.0-linux-x64-v188.8.131.52.tgz
cp cuda/include/cudnn.h /usr/local/cuda/include
cp cuda/include/cudnn.h /usr/local/cuda-10.0/include
cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
cp cuda/lib64/libcudnn* /usr/local/cuda-10.0/lib64
chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda-10.0/lib64/libcudnn*
python setup.py build
python setup.py install
protoc object_detection/protos/*.proto --python_out=.
python object_detection/train.py --logtostderr --train_dir=object_detection/training/ --pipeline_config_path=object_detection/training/ssd_mobilenet_v1_pets.config
#Exporting model into a usable state (Replace #### with the number of the latest model.ckpt file)
python export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path "training/ssd_mobilenet_v1_pets.config" \
--trained_checkpoint_prefix training/model.ckpt-#### \
You can now test this model using tensorflow’s own object detection tutorial code. You’ll have to change the paths in boxes 5 and 9.
Post Views: 27