Caffe

Caffe is a Deep Learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.

The Caffe Docker Container

Caffe is available on JADE through the use of a Docker container. For more information on JADE’s use of containers, see Using Containerised Applications.

Using Caffe Interactively

All the contained applications are launched interactively in the same way within 1 compute node at a time. The number of GPUs to be used per node is requested using the “gres” option. To request an interactive session on a compute node the following command is issued from the login node:

# Requesting 2 GPUs for Caffe image version 17.04
srun --gres=gpu:2 --pty  /jmain01/apps/docker/caffe 17.04

This command will show the following, which is now running on a compute node:

==================
== NVIDIA Caffe ==
==================

NVIDIA Release 17.04 (build 26740)

Container image Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
Copyright (c) 2014, 2015, The Regents of the University of California (Regents)
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

groups: cannot find name for group ID 1002
I have no [email protected]:/home_directory$

Note

The group ID warning and no name warning can safely be ignored.

Note

Inside the container, your home directory on the outside e.g. /jmain01/home/JAD00X/test/test1-test is mapped to the /home_directory folder inside the container.

You can test this by using the command:
ls /home_directory

You are now inside the container where Caffe is installed. Let’s check by asking for the version:

caffe --version

Where you will get something like:

caffe version 0.16.0

You can now begin training your network as normal:

caffe train -solver=my_solver.prototxt

Using Caffe in Batch Mode

There are wrappers for launching the containers within batch mode.

Firstly navigate to the folder you wish your script to lauch from, for example we’ll use the home directory:

cd ~

It is recommended that you create a script file e.g. script.sh:

#!/bin/bash

# Prints out Caffe's version number
caffe --version

And don’t forget to make your script.sh executable:

chmod +x script.sh

Then create a Slurm batch script that is used to launch the code, e.g. batch.sh:

#!/bin/bash

# set the number of nodes
#SBATCH --nodes=1

# set max wallclock time
#SBATCH --time=01:00:00

# set name of job
#SBATCH -J JobName

# set number of GPUs
#SBATCH --gres=gpu:8

# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL

# send mail to this address
#SBATCH [email protected]


#Launching the commands within script.sh
/jmain01/apps/docker/caffe-batch -c ./script.sh

You can then submit the job using sbatch:

sbatch batch.sh

On successful submission, a job ID is given:

Submitted batch job 7800

The output will appear in the slurm standard output file with the corresponding job ID (in this case slurm-7800.out). The content of the output is as follows:

==================
== NVIDIA Caffe ==
==================

NVIDIA Release 17.04 (build 26740)

Container image Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
Copyright (c) 2014, 2015, The Regents of the University of California (Regents)
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

caffe version 0.16.0