9.7 Dongguan Data Center

9.7.1 User manual of the Slurm cluster

Cluster introduction

Applications and groups supported at present are listed in the following table.

group	Application	Computing Contact
LQCD	格点QCD	SUN Wei

Resources in the Slurm cluster are listed in the following table.

partition	Num. of Nodes	Computing resources per node	memory size per node	Roles
dgpublic	245	48 CPU cores	250GB	computing nodes with X86 CPUs inside
dggpu	20	48 CPU cores, 4 NVIDIA V100 PCI-e GPU cards	488GB	computing nodes with GPU cards inside
dgarm	98	96 CPU cores	250GB	computing nodes with ARM CPUs inside
dgvfarm	210	52 CPU cores	224GB	static virtual machines with X86 CPUs inside
dgloginX86	2	-	-	login nodes
dgloginArm	2	-	-	login nodes

Available partition, account and QOS for each group are listed in the following table.

partition	account	QOS	group
dgpublic	lqcd	normal	LQCD
dggpu	lqcd	normal	LQCD
dgarm	lqcd	normal	LQCD
dgvfarm	lqcd	normal	LQCD

Step1 : Apply for your user name in the cluster & authentication

Visit the application web page： https://login.csns.ihep.ac.cn
- Users from IHEP can log in with IHEP SSO(a.k.a : IHEP mail address & password)
- Users not from IHEP should register first, and login with the registered user name
Apply for group after log in with your user name
- Profile -> Group Information -> Apply for Group
Apply for cluster authentication
- Send email to the group computing contact person & cluster administrator，so that cluster authentication is granted, and a dedicated data directory is built.
- Dedicated directories are under /dg_hpc
  - The structure of /dg_hpc directories is /dg_hpc/<group_name>/<user_name>
  - E.g：If there is a user zhangsan in the group of LQCD，then his dedicated data directory path would be /dg_hpc/LQCD/zhangsan
- **Attention!!**
  - If cluster authentication is not granted, an error message, Invalid account or account/partition combination specified, will be reported.
  - Please save job running environment, including job scripts, input/outout files, under /dg_hpc.
  - Home directories can be used as backup.

Step2 : Log in cluster

login nodes are : acn[099-100].csns.ihep.ac.cn, cn[249-250].csns.ihep.ac.cn
- IHEP users can log into lxslc7 first, and then log to one of the above login nodes
```
$ ssh <user_name>@lxslc7.ihep.ac.cn

# on lxslc7, ssh to one of the following login nodes
$ ssh <user_name>@acn099.csns.ihep.ac.cn
$ ssh <user_name>@acn100.csns.ihep.ac.cn
$ ssh <user_name>@cn249.csns.ihep.ac.cn
$ ssh <user_name>@cn250.csns.ihep.ac.cn
```
- non-IHEP users can log with VPN tunnel after apply for VPN
  - VPN application web page : https://login.csns.ihep.ac.cn
    - profile -> Group information -> apply for VPN group
  - Ask the group computing contact person and apply for VPN

Step3 ：Get job scripts ready

Slurm job script sample - CPU only

$ cat cpu_job_sample.sh
#! /bin/bash

#======= Part 1 : Job parameters ======
#SBATCH --partition=dgpublic
#SBATCH --account=lqcd
#SBATCH --qos=normal
#SBATCH --ntasks-per-node=32
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=4G
#SBATCH --job-name=lqcdcpu
#SBATCH --output=/dg_hpc/CNG/zhangsan/job_output/lqcd/lqcd_cpu_job_%j.out

#======= Part 2 : Job workload ======
echo "Lqcd job ${SLURM_JOB_ID} on cpu worker node ${SLURM_JOB_NODELIST} job starting..."
date

set -e

# sofeware workload comes here

echo "Lqcd job ${SLURM_JOB_ID} ended."
date

Slurm job script sample - CPU & GPU

$ cat gpu_job_sample.sh
#! /bin/bash

#======= Part 1 : Job parameters ======
#SBATCH --partition=dggpu
#SBATCH --account=lqcd
#SBATCH --qos=normal
#SBATCH --gpus=v100:4
#SBATCH --ntasks=4
#SBATCH --mem-per-cpu=40G
#SBATCH --job-name=lqcdgpu
#SBATCH --output=/dg_hpc/CNG/zhangsan/job_output/lqcd/lqcd_gpu_job_%j.out

#======= Part 2 : Job workload ======
echo "Lqcd job ${SLURM_JOB_ID} on gpu worker node ${SLURM_JOB_NODELIST} job starting..."
date

set -e

# sofeware workload comes here

echo "Lqcd job ${SLURM_JOB_ID} ended."
date

If jobs are going to run LQCD software, see 9.7.2 Instructions of LQCD software

Step4 ：Submit jobs

# sbatch will return back a job id
# replace <job_script.sh> with the filename of your job script
$ sbatch <job_script.sh>

Step5 ：Query jobs

# 1. jobs running or pending
# replace <user_name> with your user name
$ squeue -u <user_name>
# replace <job_id> with your job id
$ squeue -j <job_id>

# 2. jobs completed or failed

# replace <user_name> with your user name
# jobs started during past 24 hours
$ sacct -u <user_name>
# jobs started since YYYY-MM-DD, e.g.: 2021-05-03
$ sacct -u <user_name> --starttime=YYYY-MM-DD

# query a specific job with its job id
# replace <job_id> with a job id
$ sacct -j <job_id>

Step6 ：Get job result files

There are two types of output files：
- Type 1 : the output files of job scripts, a.k.a: the file specified by #SBATCH --output
- Type 2 : the output files of workload software, which are specified by users.

Step7 ：Cancel jobs

# cancel a job with job id
$ scancel -j <job_id>
# cancel jobs submitted by a user
$ scancel -u <user_name>

Q&A

Q1 Why I cannot submit job with an error Invalid account or account/partition combination specified？

A1 Three possible reasons:

Your user name is not in the valid group, please apply for the group.
Your user name is not granted to run jobs in the cluster, please get granted.
Invalid values of#SBATCH --partition, #SBATCH --qos, #SBATCH --account , please set valid values.

Q2 Why owners of some files/directories under home directory are turned to root?

A2 Home directory is adopted GlusterFS as backend storage system, this is the problem of GlusterFS, please contact the cluster administrator.

Q3 Why my jobs failed as soon as I submitted?

A3 Two possible reasons：

The value of #SBATCH --output option is a writable file, not directory, please confirm if the file path is correct.
Job workload software failed, please confirm if the software run correctly.

Q4 Why my job failed with an 'oom' error?

A4 'oom' means out of memory, please increase the value of memory option in the job script, e.g.: #SBATCH --mem-per-cpu .

9.7.2 Instructions about LQCD software

Environment setting

Set up the module environment after first time log into the dong guan cluster

mkdir -p ~/privatemodules
ln -sf /dg_hpc/LQCD/modulefiles/lqcd ~/privatemodules/lqcd
echo "module load use.own" >> ~/.bashrc
source ~/.bashrc

Use module av to check the currently available softwares at x86, arm, gpu compute nodes

Chroma software usage

Gauge configurations generated by IHEP: /dg_hpc/LQCD/gongming/productions
Sample codes for chroma usage: /dg_hpc/LQCD/sunwei/examples
Dependencies of chroma package
- x86: chroma + qphix + qdp++ + qmp
- arm: chroma + qopqdp + qdp++ + qmp
- gpu: chroma + quda + qdp-jit + qmp

NOTE: The dong guan cluster use the RoCE (RDMA over Converged Ethernet) network, therefore one need to specify the network card when runing MPI program, the following job scripts can be used for openMPI (arm, gpu) and intelMPI (x86).

#!/bin/bash
#SBATCH --partition=dgpublic
#SBATCH --account=lqcd
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --mem-per-cpu=2G
#SBATCH --job-name=lqcd-job
set -e


module load lqcd/x86/chroma/double-qphix-qdpxx-intel


MPFLAGS="-env FI_PROVIDER=verbs " # specify RoCE with intel MPI

ncore=48
export OMP_NUM_THREADS=$ncore
QPHIX_PARAM="-by 4 -bz 4 -pxy 1 -pxyz 0 -c ${ncore} -sy 1 -sz 1 -minct 2"

mpirun ${MPFLAGS} -np 1 chroma ${QPHIX_PARAM} -geom 1 1 1 1 -i input.xml -o output.xml &> log

#!/bin/bash
#SBATCH --partition=dgarm
#SBATCH --account=lqcd
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=96
#SBATCH --mem-per-cpu=2G
#SBATCH --job-name=lqcd-job
set -e

module load lqcd/arm/chroma/double-qopqdp-qdpxx

MPFLAGS="--mca pml ucx -x UCX_NET_DEVICES=mlx5_bond_1:1 "

mpirun ${MPFLAGS} -np 96 chroma -geom 1 1 1 96 -i input.xml -o output.xml &> log # setup -geom according to the lattice size

#!/bin/bash
#SBATCH --partition=dggpu
#SBATCH --account=lqcd
#SBATCH --qos=normal
#SBATCH --job-name=lqcd-job
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=50G
#SBATCH --gres=gpu:v100:1
set -e

module load lqcd/gpu/chroma/double-cuda11-qdpjit 

export OMP_NUM_THREADS=1
export QUDA_RESOURCE_PATH=${PWD}

MPFLAGS="--mca pml ucx -x UCX_NET_DEVICES=mlx5_bond_1:1 "

mpirun ${MPFLAGS} -np 1 chroma -geom 1 1 1 1 -ptxdb ./ptxdb -i input.xml -o output.xml &> log

If one need to use the chroma without MPI, then module load lqcd/gpu/chroma/double-cuda11-qdpjit-nompi or module load lqcd/gpu/chroma/single-cuda11-qdpjit-nompi can be used for double and single precision respectively.

9.7 Dongguan Data Center

9.7 Dongguan Data Center

9.7.1 User manual of the Slurm cluster

Cluster introduction

Step1 : Apply for your user name in the cluster & authentication

Step2 : Log in cluster

Step3 ：Get job scripts ready

Step4 ：Submit jobs

Step5 ：Query jobs

Step6 ：Get job result files

Step7 ：Cancel jobs

Q&A

9.7.2 Instructions about LQCD software

Environment setting

Chroma software usage

results matching ""

No results matching ""