Using the HKU CS GPU Farm (Quick Start)

Applying for an Account

Users of the Faculty of Engineering, including members of the Department of Computer Science, are eligible to use GPU Farm Phase 2.  Please visit https://intranet.cs.hku.hk/gpufarm_acct_cas/ for application. The usename of the account will be the same as your HKU Portal ID. A new password will be set for the account.

Charges apply to users who are not members of the Department of Computer Science and are not taking any designated course offered by the Department that requires the use of GPUs. See https://www.cs.hku.hk/research/major-equipment-charges for details.

An email will be sent to you after your account is created.

Phase 1 accounts for Staff and Students of the Department of Computer Science

The following users are also eligible to use GPU Farm Phase 1:

  • staff of the Department of Computer Science;
  • PhD and MPhil students of the Department of Computer Science;
  • students of BASc(FinTech), BEng(CE), BEng(CompSc), BEng(DS&E), MSc(CompSc), MSc(ECom&IComp) and MSc(FTDA)

Students of project courses (COMP4801, COMP4805, COMP7704 and COMP7705) may also apply for an additional Phase 1 account for his/her project.

Please visit https://intranet.cs.hku.hk/gpufarm_acct/ for application. The usename and password of the account will be the same as your CS account.

Due to the high utilization of GPU Farm Phase 1, new users are recommended to apply for a Phase 2 account first.

Accessing the GPU Farm

To access the GPU farm, you need to be connected to the HKU network (e.g., when you are using the wired network in CS laboratories and offices, or connected to HKU Wifi or HKUVPN). Use SSH to connect to one of the gateway nodes:

Gateway nodes for GPU Farm Phase 1: gpugate1.cs.hku.hk or gpugate2.cs.hku.hk
Gateway nodes for GPU Farm Phase 2: gpu2gate1.cs.hku.hk or gpu2gate2.cs.hku.hk

Login with your username and password (or your SSH key if you have uploaded your public key during account application), e.g.:

ssh <your_cs_username>@gpugate1.cs.hku.hk # for Phase 1 (Use 'ssh -X' enables the use of X11 forwarding if your local computer runs Linux) 
ssh <your_portal_id>@gpu2gate1.cs.hku.hk # for Phase 2

These gateway nodes provide access to the actual GPU compute nodes of the farm. You can also transfer data to your home directory of the GPU farm by using SFTP to these the gateway nodes.

The home directory of your GPU farm account is separate from other CS servers such as the academy cluster (X2GO).

To facilitate X11 forwarding in interactive mode on the GPU compute nodes, an SSH key pair (id_rsa and id_rsa.pub) and the authorized_keys file are generated in the ~/.ssh directory when your GPU farm account is created. You are free to replace the key pair with your own one, and add your own public keys to the ~/.ssh/authorized_keys file.

Using GPUs in Interactive Mode

After logging on a gateway node, you can now login to a server node with actual GPUs attached. To have an interactive session, use the gpu-interactive command to run a bash shell on a GPU node. An available  GPU compute node will be selected and allocated to you, and you will be logged on the node automatically. The GPU compute nodes are named gpu-comp-x for Phase 1 and gpu2-comp-x for Phase 2. Note the change of host name in the command prompt when you actually log on to a GPU node, e.g.,

tmchan@gpugate1:~$ gpu-interactive 
tmchan@gpu-comp-1:~$

You can verify that a GPU is allocated to you with the nvidia-smi command, e.g.:

tmchan@gpu-comp-1:~$ nvidia-smi

Sat Feb 16 17:22:06 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
| 28%   38C    P0    56W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

With the gpu-interactive command, 1 GPU, 4 CPU cores and 28 GB RAM are allocate to you.

You can now install and run software like using a normal Linux server.

Note that you do not have sudo privileges. Do not use commands such as 'sudo pip' or 'sudo apt-get' to install software.

The time limits (quotas) for GPU and CPU time start counting once you have logged on to a GPU compute node, until you logout the GPU compute node:

tmchan@gpu-comp-1:~$ exit
tmchan@gpugate1:~$

All your processes running on the GPU node will be terminated when you exit from gpu-interactive command.

Accessing Your Session with Another Terminal

After you are allocated a GPU compute node with gpu-interactive, you may access the same node with another SSH session. What you need is the actual IP address of the GPU compute node you are in. Run 'hostname -I' on the GPU compute to node find out its IP address. The output will be an IP address 10.XXX.XXX.XXX, e.g.,

tmchan@gpugate1:~$ hostname -I 10.21.5.225

Then using another terminal on your local desktop/notebook, SSH to this IP address:

ssh -X <your_cs_username>@10.XXX.XXX.XXX

These additional SSH sessions will terminate when you exit the gpu-interactive command.

Note: Do not use more than one gpu-interactive (or srun) at the same time if you just want to access your current GPU session from a second terminal, since those commands will start a new independent session and allocate an additional GPU to you, i.e., your GPU time quota will be doubly deducted. Also, you cannot access the GPUs of your previous sessions.

Software Installation

After logging on a GPU compute node using the gpu-interactive command (or SLURM's native srun command), you can install software, such as Anaconda, for using GPUs into your home directory, ​

The following NVIDIA libraries are preinstalled:

  • NVIDIA driver for GPU

  • CUDA
    Phase 1: 10.1 (in /usr/local/cuda-10.1) and 10.0 (in /usr/local/cuda-10.0)
    Phase 2: 11.0 (in /usr/local/cuda-11.0) and 10.1 (in /usr/local/cuda-10.1)

  • cuDNN
    Phase 1: 7.6.5
    Phase 2: 7.6.5 and 8.0.5

Software Installation

Your may install software on your account from respositories such as Anaconda, or by compiling from the soruce code. Running 'sudo and 'apt' are not supported.

Examples

Below are some example steps of software installation:

Note: make sure you are in a GPU compute node (the host name of prompt shows gpu-comp-x) before installing and running your software.

Anaconda (including Jupyter)


# if you are on a gateway node, login a GPU node first
gpu-interactive
# download installer, check for latest version from www.anaconda.com
wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh
# run the installer,
# and allow the installer to update your shell profile (.bashrc) to automatically initialize conda
bash Anaconda3-2023.09-0-Linux-x86_64.sh
# logout and login the GPU node again to activate the change in .bashrc
exit
# run gpu-interactive again when you are back to the gateway node
gpu-interactive
 

 

Install Pytorch in a dedicated Conda environment

# If you are on a gateway node, login a GPU node first
gpu-interactive
# Create a new environment, you may change python version 3.11 to other versions if needed
conda create -n pytorch python=3.11
# Activate the new environment
conda activate pytorch
#Then use a web browser to visit https://pytorch.org/. In the INSTALL PYTORCH session, select
#Your OS: Linux
#Package: Conda
#Language: Python
#Compute Platform: CUDA 11.x or CUDA 12.x
#Then run the command displayed in Run this command, e.g.,
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

 

Install Tensorflow in a dedicated Conda environment

# If you are on a gateway node, login a GPU node first
gpu-interactive
# Create a new environment
conda create -n tensorflow python=3.11
# Activate the new environment
conda activate tensorflow
# install tensorflow, CUDA and other supporting lbraries
pip install tensorflow[and-cuda]
# print the version of Tensorflow as a test, expected output: Tensorflow version: 2.13.1 and GPU is available 
python -c 'import tensorflow as tf; print("Tensorflow version:", tf.__version__, "and GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")'
# To use Jupyter Lab within the environment, install iPython kernel
conda install ipykernel
ipython kernel install --user --name=kernel_for_tensorflow
# and choose the kernel named kernel_for_tensorflow in Jupyter Lab 

See the TensorFlow site for supported versions of python, CUDA and cuDNN for various TensorFlow versions.

Install Jupyter Kernel for an environment

# Suppose your conda environment is named my_env. To use Jupyter Lab within the environment, install iPython kernel
conda activate my_env
conda install ipykernel
ipython kernel install --user --name=kernel_for_my_env
# and then choose kernel_for_my_env in Jypyter Lab


Using CUDA

When you install common tools such as PyTorch or TensorFlow, the installation instruction includes steps to install supporting CUDA librares. Usually there is no need to install or compile CUDA toolkit separately.

In case a separate CUDA toolkit is needed, it is available in /usr/local/cuda of all GPU nodes. To avoild conflict, CUDA is not added to the PATH variable of user accounts by default. If you need to develop with CUDA (e.g., using nvcc), you can add the following line at the your ~/.bashrc, e.g.

PATH=/usr/local/cuda/bin:$PATH

Various versions of CUDA (without nvcc) are available with Conda. If you need a version of CUDA that is not provided by the system, you can install it with Conda (after activating your conda environment). For example,

conda install -c nvidia cudnn=8.0.4 cudatoolkit=11.0 # for Tensorflow 2.4
conda install cudatoolkit=9.0 # for legacy tools

See NVIDIA's CUDA toolkit and cuDNN conda package page for available conda packages. When the conda package a version, e.g. 7.4, of cuDNN is not available, you may try the next minor version update, e.g., 7.6.0)

Running Jupyter Lab without Starting a Web Browser

Running jyputer-lab starts a web browser by default. While it is convenient when the software is run on a local computer, running a web browser on a compute node of the GPU farm not only consumes the memory and CPU power of the your session, the responsiveness of the web browser will also degrade, especially you are connecting remotely from outside of HKU. We recommend running jyputer-lab on a GPU compute node without starting a web browser, and access it with the web browser of your local computer. The steps below shows the way to do it:

  1. Login a GPU compute node from a gateway node with gpu-interactive:
    gpu-interactive

  2. Find out the IP address of the GPU compute node:
    hostname -I 
    (The output will be an IP address 10.XXX.XXX.XXX)

  3. Start Jupyter Lab with the --no-browser option and note the URL displayed at the end of the output:
    jupyter-lab --no-browser --FileContentsManager.delete_to_trash=False
    The output will look like something below:
    ...
    Or copy and paste one of these URLs:
       http://localhost:8888/?token=b92a856c2142a8c52efb0d7b8423786d2cca3993359982f1

    Note the actual port no. of the URL. It may sometimes be 8889, 8890, or 8891, etc.

  4. On your local desktop/notebook computer, start another terminal and run SSH with port forwarding to the IP address you obtained in step 2:
    ssh -L 8888:localhost:8888 <your_gpu_acct_username>@10.XXX.XXX.XXX
    (Change 8888 to the actual port no. you saw in step 3.)
    Note: The ssh command In this step should be run on your local computer. Do not login the gateway node.

  5. On your local desktop/notebook computer, start a web browser. Copy the URL from step 3 to it.

Remember to shutdown your Jupyter Lab instance and quit your gpu-interacive session after use. Leaving a Jupyter Lab instance idle on a GPU node will exhaust your GPU time quota.

Notes on file deletion: If you start Jupyter Lab without --FileContentsManager.delete_to_trash=False, files deleted with Jupyter Lab, will be moved to Trash Bin (~/.local/share/Trash) instead of deleted actually. Your disk quota may be used up by the Trash Bin eventally. To empty trash and release the disk space used, use the following command:

rm -rf ~/.local/share/Trash/*

Using tmux for Unstable Network Connections

To avoid disconnection due to unstable Wifi or VPN, you may use the tmux command, which can keep a terminal session running even when disconnected, on gpugate1 or gpugate2.

Note that tmux should be run on gpugate1 or gpugate2. Do not run tmux on a GPU node after running gpu-interative or srun. All tmux sessions on a GPU node will still be terminiated when you gpu-interative/srun session ends.

There are many on-line tutorials on the web showing how to use this command, e.g.,
https://medium.com/actualize-network/a-minimalist-guide-to-tmux-13675fb160fa

Please see these pages for details, especially on the use of detach and attach functions.

Cleaning up files to Free up Disk Space

The following guidelines help to free up disk space when you are running out of disk quota:

  • Emptying Trash. When you run  Jupyter Lab without --FileContentsManager.delete_to_trash=False option, or other GUI to manipulate files, files you try to delete will be moved to Trash Bin (~/.local/share/Trash) instead of deleted actually. To empty trash and release the disk space used, use the following command:
    rm -rf ~/.local/share/Trash/*

  • Remove installation files after sofware installation. For example, the Anaconda installation file Anaconda3-20*.*-Linux-x86_64.sh, which has a size of over 500MB, can be deleted after installation. Also check if you have downloaded the files multiple times and delete redundent copies.

  • Clean up conda installation packages. Run the following command to remove conda installation files cached in your home directory:
    conda clean --tarballs
  • Clean up cached pip installation packages:
    pip cache purge
  • Clean up intermediate files generated from the software packages you are using. Study the documentations of individual packages for details.

 

Further Information

See the Advanced Use page for information on using multiple GPU in a single session and running batch jobs, and the official site of the SLURM Workload Manager for further documentation on using SLURM.

You may also contact support@cs.hku.hk for other questions.

conda install -c nvidia cudnn=8.0.4 cudatoolkit=11.0
Don't have an account yet? Register Now!

Sign in to your account