Using the HKU CS GPU Farm (Advanced Use)

Using the HKU CS GPU Farm (Advanced)

Allocating more GPUs in an interactive session

By default, one GPU and four CPU cores are allocated to a session with the gpu-interactive command on a gateway node. If you examine the gpu-interactive command script, you will find that it calls the srun command of the SLURM system to do the session allocation accutally:

srun --gres=gpu:1 --cpus-per-task=4 --pty --mail-type=ALL bash

To have more CPUs and GPUs allocated, use the srun command with different --gres=gpu and --cpus-per-task parameters, e.g.:

srun --gres=gpu:2 --cpus-per-task=8 --pty --mail-type=ALL bash

The above command will request 2 GPUs and 8 CPU cores for a interaction session. The GPU and CPU time quotas will be deducted accordingly. The system is configured to allocate 7GB of RAM per CPU. For example, a session with 4 CPUs will have 28GB of RAM, and one with 8 CPUs will have 56GB of RAM. Our servers with GTX1080Ti or RTX2080Ti GPUs support a maximum of 4GPUs, 16 CPU cores and 112GB RAM in a single session.

Please make sure that the software that you use supports multiple GPUs before requesting more than one GPU in a session. Otherwise your time quota will be wasted.

To prevent users from occuping too many resources, each users is limited to have 2 GPUs and 8 CPUs concurrently (with limit for CS RPGs doubled in phase 1, i.e., 4 GPU and 16 CPUs).

Running Batch Jobs

If your program does not require user interaction during execution, you can submit it to the system in a batch mode. The system will schedule your job in background. You do not need to keep a terminal session on a gateway node to wait for the output. To submit a batch job,

Create a batch file, e.g., my-gpu-batch, with the following contents:
#!/bin/bash # Tell the system the resources you need. Adjust the numbers according to your need, e.g. #SBATCH --gres=gpu:1 --cpus-per-task=4 --mail-type=ALL#If you use Anaconda, initialize it . $HOME/anaconda3/etc/profile.d/conda.sh conda activate tensorflow # cd your your desired directory and execute your program, e.g. cd _to_your_directory_you_need _run_your_program_
On a gateway node (gpugate1 or gpugate2), submit your batch job to the system with the following command on a .:
sbatch my-gpu-batch

Note the job id displayed. The output of your program will be saved in slurm-<job id>.out

A mail will be sent to you when your job starts and ends.

Use "squeue -u $USER" to see the status of your jobs in the system queue.

To cancel a job note the job ids of "squeue -u $USER" and use 'scancel <job id>' to cancel a job.

The concurrent CPU and GPU limits (2 GPU/8 CPU for general users and 4 GPU/16 CPU for CS RPGs in phase 1) also apply to batch jobs. If you need to run multiple batch jobs concurrently. Please contact support@cs.hku.hk for a temperorary arrangement of increasing your concurrent limit.

Using RTX3090 GPUs in GPU farm Phase 1

A small number of RTX3090 GPUs, arranged in pairs connected with NVLink bridges, are available in GPU farm Phase 1.

A SLURM session with RTX3090 must request 2 GPUs, which are connected with an NVLink bridge, and supports system (not GPU) memory sizes up to 224GB.

To start a session with RTX3090, use the command line options '-p q3090' and '--gres:gpu:rtx3090:2' with srun and sbatch commands. For example, the following srun command allocates a session with two units of RTX3090, 8 CPU cores (and 224GB system memory implicitly).

srun -p q3090 --gres=gpu:rtx3090:2 --cpus-per-task=8 --pty --mail-type=ALL bash

The default time limit of an RTX3090 session is 6 hours, i.e., a session will be closed and all running processes in the session will be terminated in 6 hours after start. To have a longer session (up to 48 hours), use the '-t HH:MM:SS' command line option with srun and sbatch. For example, the following #SBATCH directive instructs the sbatch command to run a job in a 12 hour session, and a reminder email will be sent when 80% of the time limit is reached:

#!/bin/bash
#SBATCH -p q3090 --gres=gpu:rtx3090:2 --cpus-per-task=8 -t 12:00:00 --mail-type=ALL,TIME_LIMIT_80
your_script_starts_here

Sessions with extended time limits, 4xRTX3090 GPUs, up to 32 CPU cores and 448GB system memory can be arranged on request for users whose programs need more GPUs, time and system resources to run.

Further Information

Please visit the official site of the SLURM Workload Manager for further documentation on using SLURM.