Allocating more GPUs in an interactive session
By default, one GPU and four CPU cores are allocated to a session with the gpu-interactive command on a gateway node. If you examine the gpu-interactive command script, you will find that it calls the srun command of the SLURM system to do the session allocation accutally:
srun --gres=gpu:1 --cpus-per-task=4 --pty --mail-type=ALL bash
To have more CPUs and GPUs allocated, use the srun command with different --gres=gpu
and --cpus-per-task
parameters, e.g.:
srun --gres=gpu:2 --cpus-per-task=8 --pty --mail-type=ALL bash
The above command will request 2 GPUs and 8 CPU cores for a interaction session. The GPU and CPU time quotas will be deducted accordingly. The system is configured to allocate 7GB of RAM per CPU. For example, a session with 4 CPUs will have 28GB of RAM, and one with 8 CPUs will have 56GB of RAM. Our servers with GTX1080Ti or RTX2080Ti GPUs support a maximum of 4GPUs, 16 CPU cores and 112GB RAM in a single session.
Please make sure that the software that you use supports multiple GPUs before requesting more than one GPU in a session. Otherwise your time quota will be wasted.
To prevent users from occuping too many resources, each users is limited to have 2 GPUs and 8 CPUs concurrently (with limit for CS RPGs doubled in phase 1, i.e., 4 GPU and 16 CPUs).
Running Batch Jobs
If your program does not require user interaction during execution, you can submit it to the system in a batch mode. The system will schedule your job in background. You do not need to keep a terminal session on a gateway node to wait for the output. To submit a batch job,
- Create a batch file, e.g., my-gpu-batch, with the following contents:
#!/bin/bash
# Tell the system the resources you need. Adjust the numbers according to your need, e.g.
#SBATCH --gres=gpu:1 --cpus-per-task=4 --mail-type=ALL
#If you use Anaconda, initialize it
. $HOME/anaconda3/etc/profile.d/conda.sh
conda activate tensorflow
# cd your your desired directory and execute your program, e.g.
cd _to_your_directory_you_need
_run_your_program_
- On a gateway node (gpugate1 or gpugate2), submit your batch job to the system with the following command on a .:
sbatch my-gpu-batch
Note the job id displayed. The output of your program will be saved in slurm-<job id>.out
A mail will be sent to you when your job starts and ends.
Use "squeue -u $USER" to see the status of your jobs in the system queue.
To cancel a job note the job ids of "squeue -u $USER" and use 'scancel <job id>' to cancel a job.
The concurrent CPU and GPU limits (2 GPU/8 CPU for general users and 4 GPU/16 CPU for CS RPGs in phase 1) also apply to batch jobs. If you need to run multiple batch jobs concurrently. Please contact support@cs.hku.hk for a temperorary arrangement of increasing your concurrent limit.
Using RTX3090 GPUs in GPU farm Phase 1
A small number of RTX3090 GPUs, arranged in pairs connected with NVLink bridges, are available in GPU farm Phase 1.
A SLURM session with RTX3090 must request 2 GPUs, which are connected with an NVLink bridge, and supports system (not GPU) memory sizes up to 224GB.
To start a session with RTX3090, use the command line options '-p q3090
' and '--gres:gpu:rtx3090:2
' with srun
and sbatch
commands. For example, the following srun
command allocates a session with two units of RTX3090, 8 CPU cores (and 224GB system memory implicitly).
srun -p q3090 --gres=gpu:rtx3090:2 --cpus-per-task=8 --pty --mail-type=ALL bash
The default time limit of an RTX3090 session is 6 hours, i.e., a session will be closed and all running processes in the session will be terminated in 6 hours after start. To have a longer session (up to 48 hours), use the '-t HH:MM:SS
' command line option with srun
and sbatch
. For example, the following #SBATCH
directive instructs the sbatch
command to run a job in a 12 hour session, and a reminder email will be sent when 80% of the time limit is reached:
#!/bin/bash
#SBATCH -p q3090 --gres=gpu:rtx3090:2 --cpus-per-task=8
-t 12:00:00
--mail-type=ALL,TIME_LIMIT_80 your_script_starts_here
Sessions with extended time limits, 4xRTX3090 GPUs, up to 32 CPU cores and 448GB system memory can be arranged on request for users whose programs need more GPUs, time and system resources to run.
Further Information
Please visit the official site of the SLURM Workload Manager for further documentation on using SLURM.