3.2.2 Slurm Cluster User Manual

3.2.2.1 Usage of the Slurm CPU Cluster

Introduction of CPU Resources

Users of the Slurm CPU cluster are from the following groups:

Group	Application	Contact perspon
mbh	Black hole	Yanrong Li
bio	Biology	Lina Zhao
cac	Chemistry	Lina Zhao
nano	Nanophysics	Lina Zhao
heps	Accelerator design for HEPS	Yi Jiao / Zhe Duan
cepcmpi	Accelerator design for CEPC	Yuan Zhang / Yiwei Wang
alicpt	Ali experiment	Hong Li
bldesign	Beamline studies for HEPS	Haifeng Zhao
raq	Quantum Chemistry, Molecular dynamics	Jianhui Lan

Each group is consuming sperated resources, and the following table lists computing resources, partitions and QOS (job queue) for each group.

Partition	QOS	Account / Group	Worker nodes
mbh,mbh16	regular	mbh	16 nodes，256 CPU cores
cac	regular	cac	8 nodes，384 CPU cores
nano	regular	nano	7 nodes，336 CPU cores
biofastq	regular	bio	12 nodes，288 CPU cores
heps	regular,advanced	heps	34 nodes，1224 CPU cores
hepsdebug	hepsdebug	heps	1 node, 36 CPU cores
cepcmpi	regular	cepcmpi	36 nodes，1696 CPU cores
ali	regular	alicpt	16 nodes，576 CPU cores
bldesign	blregular	bldesign	3 nodes，108 CPU cores
raq	regular	raq	12 nodes, 672 CPU cores

Resource limits for each QOS is shown in the following table.

QOS	Max Running Time for each job	Priority	Maximum number of submitted jobs
regular	60 days	low	4000 jobs per user, 8000 jobs per group
advanced	60 days	high	-, -
hepsdebug	30 minutes	medium	100 jobs per user, -
blregular	30 days	low	200 jobs per user, 1000 jobs per group

Step 0 AFS account application and cluster grant

Users who already have the AFS account and cluster grant could skip this step.
For new users
- Apply for the AFS account ：application web page
- Cluster grant will done automatically
For the ungranted users
- Send an email to the group administrator for user grant.
  - An job submission error may be encountered if not granted
```
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
```
- After granted by group administrator and Slurm cluster administrator, jobs could be submitted and run in the cluster.

Step1 Get job script ready

Edit the script with your preferred editor, for example, vim.

The sample job script can be found under the following directory.

/cvmfs/slurm.ihep.ac.cn/slurm_sample_script

Mark：The sample job script file is stored in CVMFS filesystem. CVMFS usage is shown with the following commands.

# ssh to the login nodes with your AFS account.
# Replace <AFS_user_name> with your AFS account.
> ssh <AFS_user_name>@lxlogin.ihep.ac.cn

# Change to the sample script directory.
> cd /cvmfs/slurm.ihep.ac.cn/slurm_sample_script

# Check the sample script.
> ls -lht
-rw-rw-r-- 1 cvmfs cvmfs 2.2K 8月   2 10:30 slurm_sample_script_cpu.sh

Get your own job script according to the sample job script.

# Content and comments of the sample job script.
> cat slurm_sample_script_1.sh

#! /bin/bash

#=====================================================
#===== Modify the following options for your job =====
#=====    DON'T remove the #! /bin/bash lines    =====
#=====      DON'T comment #SBATCH lines          =====
#=====        of partition,account and           =====
#=====                qos                        =====
#=====================================================

# Specify the partition name from which resources will be allocated  
#SBATCH --partition=mbh

# Specify which expriment group you belong to.
# This is for the accounting, so if you belong to many experiments,
# write the experiment which will pay for your resource consumption
#SBATCH --account=mbh

# Specify which qos(job queue) the job is submitted to.
#SBATCH --qos=regular

#=====================================================
#===== Modify the following options for your job =====
#=====   The following options are not mandatory =====
#=====================================================

# Specify your job name, e.g.: "test_job" is my job name
#SBATCH --job-name=test_job

# Specify how many cores you will need, e.g.: 16
#SBATCH --ntasks=16

# Specify how many memory per CPU are request
# The resolution is MB, e.g. : the following line asks for 1GB per CPU
#SBATCH --mem-per-cpu=1GB

# Specify the output file path of your job 
# Attention!! It's a output file name, not a directory
# Also you must have write access to this file
# An output file name example is : job-%j.out, where %j will be replace with job id
#SBATCH --output=/path/to/your/output/file

#=================================
#===== Run your job commands =====
#=================================

#=== You can define your variables following the normal bash style
VAR1="value1"
VAR2="value2"

#=== Run you job commands in the following lines
# srun is necessary, with srun, Slurm will allocate the resources you are asking for

# You can run an executable script with srun
# modify /path/to/your/script to a real file path name
srun /path/to/your/script

# Or if your program is written with MPI, you can run it with mpiexec 
# First, run a simple command with srun
srun -l hostname

# later, you can run your MPI program with mpiexec
# The output will be written under the path specified by the --output option
# modify /path/to/your/mpi_program to your real program file path
mpiexec /path/to/your/mpi_program

Some explanation of the sample job script.

Normally, Job scripts are consisted with two part:

Job parameters : started with #SBATCH to specify parameter values.

Job workload: normally are execuatable files with options and arguments，e.g. , executable scripts, MPI programs etc.

Job parameters partition, account and qos are mandatory, or jobs will be failed to submit.

Step2 Job submission

When your job script is ready, it's time to submit the job with the following command:

# log into lxlogin.ihep.ac.cn
$ ssh <AFS_user_name>@lxlogin.ihep.ac.cn

# Type the command sbatch to sumbit a job
$ sbatch slurm_sample_script_1.sh

Step3 Job Query

To query a single job.

Once a job submitted, the command sbatch will return back a job id. Users can query the job with this job id. The query command is :

 # Use the command sacct to check the status of a single job
 # where <job_id> stands for the job id returned by the sbatch command
 $ sacct -j <jobid>

To query jobs submitted by a user.

To query all jobs submitted by a user after 0:00, one should type the following command:
```
 # <AFS_user_name> can be replaced with user name
 $ sacct -u <AFS_user_name>
```

Or use sacct command

To query all jobs submitted by a user after a specified day, one should type the following command:

# <AFS_user_name> can be replaced with user name
# --starttime specifies the query start time point with the format of 'YYYY-MM-DD'
$ sacct -u <AFS_user_name> --starttime='YYYY-MM-DD'

Step4 Job result

Once the submitted job is done, one can get the output results.

If the output file is not specified, then the default output file is saved under the working directory where users submitted the job. And the default output file name is <job_id>.out , <job_id>is the job id.
- For example, if the job id is 1234, the output file name is 1234.out.
If output file is specified, the output results can be found in the specified file.
If job workload redirect the output, please check the redirected output files to get the job results.

Step5 Job cancellation

To cancel a submitted job, one can type the following command.

# Use scancel command to cancel a job, where <job_id> is the job id returned by sbatch
$ scancel <job_id>

Step6 Cluster status query

To check partition names of the Slurm cluster, or to query resource status of partitions, one can type the following command:

# Use sinfo command to query resource status
$ sinfo

3.2.2.2 Usage of the Slum GPU Cluster

Introduction of GPU Resources

Authorized groups that can access the GPU cluster are listed in the following table.

group	Applications	Contact person
lqcd	Lattice QCD	Ying Chen / Ming Gong
gpupwa	Partial Wave Analysis	Beijiang Liu / Liaoyuan Dong
junogpu	Neutrino Analysis	Wuming Luo
mlgpu	Machine Learning apps of BESIII	Yao Zhang
higgsgpu	GPU acceleration for CEPC software	Gang Li
bldesign	Beamline applications for HEPS experiment	Haifeng Zhao
ucasgpu	Machine Learning for UCAS	Xiaorui Lv
pqcd	Perturbative QCD calculation	Zhao Li
cmsgpu	Machine Learning apps of CMS	Huaqiao Zhang, Mingshui Chen
neuph	Theory of Neutrino and Phenomenology	Yufeng Li
atlasgpu	Machine Learning apps of ATLAS	Contact of ATLAS
lhaasogpu	Machine Learning apps of LHAASO	Contact of LHAASO
herdgpu	Machine Learning apps of HERD	Contact of HERD
qc	Quantum Computing	Contact of CC

GPU cluster is devided into two resource partition, each partition has different QOS (queue) and group, see the following table.

Partition	QOS	Group	Resource limitation	Num. of Nodes
lgpu	long	lqcd	QOS long - Run time of Jobs <= 30 days - Total number of submit jobs(running + queued) <= 64 - Memory requested per CPU per job <= 40GB	- one worker node，384 GB memory per node. - 8 NVIDIA V100 nvlink GPU cards, 36 CPU cores in total.
gpu	normal, debug	lqcd, gpupwa, junogpu,mlgpu,higgsgpu	QOS normal - Run time of Jobs <= 48 hours - Total number of submitted jobs(running + queued) per group<= 512, total gpu card number per group <= 128 - Total number of submit jobs(running + queued) per user<= 96, total GPU card number per user <= 64 - Memory requested per CPU per job <=40GB QOS debug - Run time of Jobs <= 15 minutes - Total number of jobs(running + queued) per group <= 256, total gpu card number per group <= 64 - Total number of jobs(running + queued) per user <= 24, total GPU card number per user <= 16 - Memory requested per CPU per job <= 40GB - The priority of QOS debug is higher than the priority of QOS normal	- 23 worker nodes，384 GB memory per node. - 182 NVIDIA V100 nvlink GPU cards, 840 CPU cores in total.
ucasgpu	ucasnormal	ucasgpu	QOS ucasnormal - Run time of Jobs <= 48 hours - Total number of submitted jobs(running + queued) per group<= 200, total gpu card number per group <= 40 - Total number of submit jobs(running + queued) per user<= 18, total GPU card number per user <= 6 - Memory requested per CPU per job <=40GB	- one worker node，384 GB memory per node. - 8 NVIDIA V100 nvlink GPU cards, 36 CPU cores in total.
pqcdgpu	pqcdnormal	pqcd	QOS ucasnormal - Run time of Jobs <= 72 hours - Total number of submitted jobs(running + queued) per group<= 100, total gpu card number per group <= 100 - Total number of submit jobs(running + queued) per user<= 20, total GPU card number per user <= 20 - Memory requested per CPU per job <=32GB	- one worker node, 192GB memory per node - 5 NVIDIA V100 PCI-e GPU cards, 20 CPU cores in total.

Explanations about QOS debug :

debug is suitable for the following types of jobs:

to test codes under development

short run time

For example, for the test jobs from group mlgpu and higgsgpu, it is recommended to submit jobs to the QOS debug.

For other groups, jobs from gpupwa group, 75% jobs are finished within one hour according to statistics, it is recommended to submit these short jobs to the QOS debug.

Step1 Apply for your computing account

Users who already have his/her account granted could skip this step.
For the new users
- Apply for the account on the application web page.
- New users will be granted automatically
For the not granted users:
- Users should send an email to the group administrator for cluster grant.
- After granted by group administrator and Slurm cluster administrator, jobs could be submitted and run in the cluster.
  - Not granted users will encounter the following error
```
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
```

Step2 Prepare your executable programs

Software of Group lqcd could be stored in the dedicated AFS directory /afs/ihep.ac.cn/soft/lqcd/, currently the upper storage limit of this directory is 100GB.
Users from higgsgpu,junogpu,gpupwa,mlgpu,bldesign could install your software under /hpcfs, directory paths for each group can be found in Step3.
Users from other groups could install your software under /scratchfs, or other dedicated data directory from your experiment.
If there are any special software requirements, please contact the cluster admin.

Step3 Prepare your storage I/O directory

There is a dedicated I/O directory for GPU cluster users from the above mentioned groups.
- Directory path for the group lqcd users: /hpcfs/lqcd/qcd/
- Directory path for the group gpupwa users: /hpcfs/bes/gpupwa/
- Directory path for the group junogpu: /hpcfs/juno/junogpu/
- Directory path for the group mlgpu : /hpcfs/bes/mlgpu/
- Directory path for the group higgsgpu: /hpcfs/cepc/higgs/
- Directory path for the group bldesign : /hpcfs/heps/bldesign/
- Directory path for the group ucasgpu: /hpcfs/cepc/ucas/
Input / Output files could be stored under the user's private sub-directory. Take lqcd group as an example, if there is a user zhangsan has Input / Output files, these files could be put under the directory /hpcfs/lqcd/qcd/zhangsan/.
For users who didn't find your data directory listed above, /scratchfs could be used as your data directory.

Step4 Prepare your job script

Job script is a bash script, and is consisted with two parts
- Part 1 : Job prameters. Lines of this part is started with #SBATCH which are used to specify resource partition, QOS(job queue), number of required resources(CPU/GPU/memory), job name, output file path, etc.
- Part 2 : job workload. For example, executable scripts, programs, etc.
Attention!!
- No commands lie between the lines start with #SBATCH and the line start with #!, or the job running parameters will be parsed by mistake, which could make the job get wrong resources allocated, and jobs will be failed at last.
- Blank or comment lines could be filled between the #SBATCH lines and the #! line.
A job script sample is shown below.

#! /bin/bash

######## Part 1 #########
# Script parameters     #
#########################

# Specify the partition name from which resources will be allocated, mandatory option
#SBATCH --partition=gpu

# Specify the QOS, mandatory option
#SBATCH --qos=normal

# Specify which group you belong to, mandatory option
# This is for the accounting, so if you belong to many group,
# write the experiment which will pay for your resource consumption
#SBATCH --account=lqcd

# Specify your job name, optional option, but strongly recommand to specify some name
#SBATCH --job-name=gres_test

# Specify how many cores you will need, default is one if not specified
#SBATCH --ntasks=2

# Specify the output file path of your job
# Attention!! Your afs account must have write access to the path
# Or the job will be FAILED!
#SBATCH --output=/home/cc/duran/job_output/gpujob-%j.out

# Specify memory to use, or slurm will allocate all available memory in MB
#SBATCH --mem-per-cpu=2048

# Specify how many GPU cards to use
#SBATCH --gres=gpu:v100:2

######## Part 2 ######
# Script workload    #
######################

# Replace the following lines with your real workload
# For example to list the allocated hosts and sleep 3 minutes
srun -l hostname  
sleep 180

More Information

Specifications of --partitiion, --account, --qos options for each group

Group Job types --partition --account(normally same as the group) --qos

lqcd long jobs lgpu lqcd long

lqcd,gpupwa,higgsgpu,mlgpu,junogpu normal jobs gpu lqcd,gpupwa,higgsgpu,mlgpu,junogpu normal

lqcd,gpupwa,higgsgpu,mlgpu,junogpu debug jobs gpu lqcd,gpupwa,higgsgpu,mlgpu,junogpu debug

bldesign normal jobs gpu bldesign blnormal

bldesign debug jobs gpu bldesign bldebug

ucasgpu normal jobs ucasgpu ucasgpu ucasnormal

pqcd normal jobs pqcdgpu pqcd pqcdnormal

#SBATCH --mem-per-cpu option is used to specify required memory size. If this parameter is not given, default size is 4GB per CPU core, the maximum memory size is 32GB per CPU core. Please specify the memory size according to your practical requirements.

Explation for the option #SBATCH --time

It may take less queued time for the jobs, if --time option is sepecified

Especially for the jobs from group gpupwa whose job number is quite large.

To use --time option, the following lines could be modified and added in the job script
# To tell how long does it take to finish the job, e.g.: 2 hours in the following line
#SBATCH --time=2:00:00

# for the jobs will be run more than 24 hours, use the following time format
# e.g. : this job will run for 1 day and 8 hours
#SBATCH --time=1-8:00:00
For the users not experienced, run time statistics of historical jobs can be used as reference:

Group Run time Porbability

gpupwa <= 1 hour 90.43%

lqcd <= 32 hours 90.37%

junogpu <= 12 hours 91.24%

Jobs from group mlgpu and higgsgpu are small, it is recommended to use QOS debug, --time option could be omitted for now.

If Jobs ran longer than specified --time option, the Scheduling system will clean the overtimed jobs itself.

Group	Job types	--partition	--account(normally same as the group)	--qos
lqcd	long jobs	lgpu	lqcd	long
lqcd,gpupwa,higgsgpu,mlgpu,junogpu	normal jobs	gpu	lqcd,gpupwa,higgsgpu,mlgpu,junogpu	normal
lqcd,gpupwa,higgsgpu,mlgpu,junogpu	debug jobs	gpu	lqcd,gpupwa,higgsgpu,mlgpu,junogpu	debug
bldesign	normal jobs	gpu	bldesign	blnormal
bldesign	debug jobs	gpu	bldesign	bldebug
ucasgpu	normal jobs	ucasgpu	ucasgpu	ucasnormal
pqcd	normal jobs	pqcdgpu	pqcd	pqcdnormal

Group	Run time	Porbability
gpupwa	<= 1 hour	90.43%
lqcd	<= 32 hours	90.37%
junogpu	<= 12 hours	91.24%

Sample job scripts could be found with the following path

/cvmfs/slurm.ihep.ac.cn/slurm_sample_script

Some comments

Sample job scripts are stored in the CVMFS filesystem, access CVMFS with the following commands:

# log into the lxlogin nodes with your AFS account
$ ssh <AFS_user_name>@lxlogin.ihep.ac.cn

# Go to the directory where sample job scripts could be found
$ cd /cvmfs/slurm.ihep.ac.cn/slurm_sample_script

# get sample jobs scripts
$ ls -lht
-rw-rw-r-- 1 cvmfs cvmfs 1.4K 8月  12 18:31 slurm_sample_script_gpu.sh

Step5 Submit your job

ssh login nodes.

# Issue ssh command to log in.
# Replace <AFS_user_name> with your user name.
$ ssh <AFS_user_name>@lxlogin.ihep.ac.cn

The command to submit a job：

# command to submit a job
$ sbatch <job_script.sh>

# <job_script.sh> is the name of the script, e.g: v100_test.sh, then the command is:
$ sbatch v100_test.sh

# There will be a jobid returned as a message if the job is submitted successfully

Step6 Check job status

The command to show job status is shown below.

# command to check the job queue
$ squeue

# command to check the jobs submitted by user
$ sacct -u <AFS_user_name>

Step7 Cancel your job

The Command to cancel a job is listed below.

# command to cancel the job
# <jobid> can be found using the command sacct
$ scancel <jobid>

3.2.2 Slurm