3.2.2 Slurm Cluster User Manual

3.2.2.1 Usage of the Slurm CPU Cluster

Introduction of CPU Resources

  • Users of the Slurm CPU cluster are from the following groups:
Group Application Contact perspon
mbh Black hole Yanrong Li
bio Biology Lina Zhao
cac Chemistry Lina Zhao
nano Nanophysics Lina Zhao
heps Accelerator design for HEPS Yi Jiao / Zhe Duan
cepcmpi Accelerator design for CEPC Yuan Zhang / Yiwei Wang
alicpt Ali experiment Hong Li
bldesign Beamline studies for HEPS Haifeng Zhao
raq Quantum Chemistry, Molecular dynamics Jianhui Lan
  • Each group is consuming sperated resources, and the following table lists computing resources, partitions and QOS (job queue) for each group.
Partition QOS Account / Group Worker nodes
mbh,mbh16 regular mbh 16 nodes,256 CPU cores
cac regular cac 8 nodes,384 CPU cores
nano regular nano 7 nodes,336 CPU cores
biofastq regular bio 12 nodes,288 CPU cores
heps regular,advanced heps 34 nodes,1224 CPU cores
hepsdebug hepsdebug heps 1 node, 36 CPU cores
cepcmpi regular cepcmpi 36 nodes,1696 CPU cores
ali regular alicpt 16 nodes,576 CPU cores
bldesign blregular bldesign 3 nodes,108 CPU cores
raq regular raq 12 nodes, 672 CPU cores
  • Resource limits for each QOS is shown in the following table.
QOS Max Running Time for each job Priority Maximum number of submitted jobs
regular 60 days low 4000 jobs per user, 8000 jobs per group
advanced 60 days high -, -
hepsdebug 30 minutes medium 100 jobs per user, -
blregular 30 days low 200 jobs per user, 1000 jobs per group

Step 0 AFS account application and cluster grant

  • Users who already have the AFS account and cluster grant could skip this step.

  • For new users

  • For the ungranted users

    • Send an email to the group administrator for user grant.
      • An job submission error may be encountered if not granted
    sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
    
    • After granted by group administrator and Slurm cluster administrator, jobs could be submitted and run in the cluster.

Step1 Get job script ready

  1. Edit the script with your preferred editor, for example, vim.

    • The sample job script can be found under the following directory.
    /cvmfs/slurm.ihep.ac.cn/slurm_sample_script
    

    Mark:The sample job script file is stored in CVMFS filesystem. CVMFS usage is shown with the following commands.

    # ssh to the login nodes with your AFS account.
    # Replace <AFS_user_name> with your AFS account.
    > ssh <AFS_user_name>@lxslc7.ihep.ac.cn
    
    # Change to the sample script directory.
    > cd /cvmfs/slurm.ihep.ac.cn/slurm_sample_script
    
    # Check the sample script.
    > ls -lht
    -rw-rw-r-- 1 cvmfs cvmfs 2.2K 82 10:30 slurm_sample_script_cpu.sh
    
    • Get your own job script according to the sample job script.
    # Content and comments of the sample job script.
    > cat slurm_sample_script_1.sh
    
    #! /bin/bash
    
    #=====================================================
    #===== Modify the following options for your job =====
    #=====    DON'T remove the #! /bin/bash lines    =====
    #=====      DON'T comment #SBATCH lines          =====
    #=====        of partition,account and           =====
    #=====                qos                        =====
    #=====================================================
    
    # Specify the partition name from which resources will be allocated  
    #SBATCH --partition=mbh
    
    # Specify which expriment group you belong to.
    # This is for the accounting, so if you belong to many experiments,
    # write the experiment which will pay for your resource consumption
    #SBATCH --account=mbh
    
    # Specify which qos(job queue) the job is submitted to.
    #SBATCH --qos=regular
    
    #=====================================================
    #===== Modify the following options for your job =====
    #=====   The following options are not mandatory =====
    #=====================================================
    
    # Specify your job name, e.g.: "test_job" is my job name
    #SBATCH --job-name=test_job
    
    # Specify how many cores you will need, e.g.: 16
    #SBATCH --ntasks=16
    
    # Specify how many memory per CPU are request
    # The resolution is MB, e.g. : the following line asks for 1GB per CPU
    #SBATCH --mem-per-cpu=1GB
    
    # Specify the output file path of your job 
    # Attention!! It's a output file name, not a directory
    # Also you must have write access to this file
    # An output file name example is : job-%j.out, where %j will be replace with job id
    #SBATCH --output=/path/to/your/output/file
    
    #=================================
    #===== Run your job commands =====
    #=================================
    
    #=== You can define your variables following the normal bash style
    VAR1="value1"
    VAR2="value2"
    
    #=== Run you job commands in the following lines
    # srun is necessary, with srun, Slurm will allocate the resources you are asking for
    
    # You can run an executable script with srun
    # modify /path/to/your/script to a real file path name
    srun /path/to/your/script
    
    # Or if your program is written with MPI, you can run it with mpiexec 
    # First, run a simple command with srun
    srun -l hostname
    
    # later, you can run your MPI program with mpiexec
    # The output will be written under the path specified by the --output option
    # modify /path/to/your/mpi_program to your real program file path
    mpiexec /path/to/your/mpi_program
    

    Some explanation of the sample job script.

    1. Normally, Job scripts are consisted with two part:

    2. Job parameters : started with #SBATCH to specify parameter values.

    3. Job workload: normally are execuatable files with options and arguments,e.g. , executable scripts, MPI programs etc.
    4. Job parameters partition, account and qos are mandatory, or jobs will be failed to submit.

Step2 Job submission

When your job script is ready, it's time to submit the job with the following command:

# log into lxslc7.ihep.ac.cn
$ ssh <AFS_user_name>@lxslc7.ihep.ac.cn

# Type the command sbatch to sumbit a job
$ sbatch slurm_sample_script_1.sh

Step3 Job Query

  • To query a single job.

    Once a job submitted, the command sbatch will return back a job id. Users can query the job with this job id. The query command is :

     # Use the command sacct to check the status of a single job
     # where <job_id> stands for the job id returned by the sbatch command
     $ sacct -j <jobid>
    
  • To query jobs submitted by a user.

    To query all jobs submitted by a user after 0:00, one should type the following command:

     # <AFS_user_name> can be replaced with user name
     $ sacct -u <AFS_user_name>
    
  • Or use sacct command

    To query all jobs submitted by a user after a specified day, one should type the following command:

    # <AFS_user_name> can be replaced with user name
    # --starttime specifies the query start time point with the format of 'YYYY-MM-DD'
    $ sacct -u <AFS_user_name> --starttime='YYYY-MM-DD'
    

Step4 Job result

Once the submitted job is done, one can get the output results.

  • If the output file is not specified, then the default output file is saved under the working directory where users submitted the job. And the default output file name is <job_id>.out , <job_id>is the job id.
    • For example, if the job id is 1234, the output file name is 1234.out.
  • If output file is specified, the output results can be found in the specified file.
  • If job workload redirect the output, please check the redirected output files to get the job results.

Step5 Job cancellation

To cancel a submitted job, one can type the following command.

# Use scancel command to cancel a job, where <job_id> is the job id returned by sbatch
$ scancel <job_id>

Step6 Cluster status query

To check partition names of the Slurm cluster, or to query resource status of partitions, one can type the following command:

# Use sinfo command to query resource status
$ sinfo

3.2.2.2 Usage of the Slum GPU Cluster

Introduction of GPU Resources

  • Authorized groups that can access the GPU cluster are listed in the following table.
group Applications Contact person
lqcd Lattice QCD Ying Chen / Ming Gong
gpupwa Partial Wave Analysis Beijiang Liu / Liaoyuan Dong
junogpu Neutrino Analysis Wuming Luo
mlgpu Machine Learning apps of BESIII Yao Zhang
higgsgpu GPU acceleration for CEPC software Gang Li
bldesign Beamline applications for HEPS experiment Haifeng Zhao
ucasgpu Machine Learning for UCAS Xiaorui Lv
pqcd Perturbative QCD calculation Zhao Li
cmsgpu Machine Learning apps of CMS Huaqiao Zhang, Mingshui Chen
neuph Theory of Neutrino and Phenomenology Yufeng Li
atlasgpu Machine Learning apps of ATLAS Contact of ATLAS
lhaasogpu Machine Learning apps of LHAASO Contact of LHAASO
herdgpu Machine Learning apps of HERD Contact of HERD
qc Quantum Computing Contact of CC
  • GPU cluster is devided into two resource partition, each partition has different QOS (queue) and group, see the following table.
Partition QOS Group Resource limitation Num. of Nodes
lgpu long lqcd QOS long
- Run time of Jobs <= 30 days
- Total number of submit jobs(running + queued) <= 64
- Memory requested per CPU per job <= 40GB
- one worker node,384 GB memory per node.
- 8 NVIDIA V100 nvlink GPU cards, 36 CPU cores in total.
gpu normal, debug lqcd, gpupwa, junogpu,mlgpu,higgsgpu QOS normal
- Run time of Jobs <= 48 hours
- Total number of submitted jobs(running + queued) per group<= 512, total gpu card number per group <= 128
- Total number of submit jobs(running + queued) per user<= 96, total GPU card number per user <= 64
- Memory requested per CPU per job <=40GB
QOS debug
- Run time of Jobs <= 15 minutes
- Total number of jobs(running + queued) per group <= 256, total gpu card number per group <= 64
- Total number of jobs(running + queued) per user <= 24, total GPU card number per user <= 16
- Memory requested per CPU per job <= 40GB
- The priority of QOS debug is higher than the priority of QOS normal
- 23 worker nodes,384 GB memory per node.
- 182 NVIDIA V100 nvlink GPU cards, 840 CPU cores in total.
ucasgpu ucasnormal ucasgpu QOS ucasnormal
- Run time of Jobs <= 48 hours
- Total number of submitted jobs(running + queued) per group<= 200, total gpu card number per group <= 40
- Total number of submit jobs(running + queued) per user<= 18, total GPU card number per user <= 6
- Memory requested per CPU per job <=40GB
- one worker node,384 GB memory per node.
- 8 NVIDIA V100 nvlink GPU cards, 36 CPU cores in total.
pqcdgpu pqcdnormal pqcd QOS ucasnormal
- Run time of Jobs <= 72 hours
- Total number of submitted jobs(running + queued) per group<= 100, total gpu card number per group <= 100
- Total number of submit jobs(running + queued) per user<= 20, total GPU card number per user <= 20
- Memory requested per CPU per job <=32GB
- one worker node, 192GB memory per node
- 5 NVIDIA V100 PCI-e GPU cards, 20 CPU cores in total.

Explanations about QOS debug :

  • debug is suitable for the following types of jobs:
    • to test codes under development
    • short run time
  • For example, for the test jobs from group mlgpu and higgsgpu, it is recommended to submit jobs to the QOS debug.
  • For other groups, jobs from gpupwa group, 75% jobs are finished within one hour according to statistics, it is recommended to submit these short jobs to the QOS debug.

Step1 Apply for your computing account

  • Users who already have his/her account granted could skip this step.

  • For the new users

  • For the not granted users:

    • Users should send an email to the group administrator for cluster grant.

    • After granted by group administrator and Slurm cluster administrator, jobs could be submitted and run in the cluster.

      • Not granted users will encounter the following error
      sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
      

Step2 Prepare your executable programs

  • Software of Group lqcd could be stored in the dedicated AFS directory /afs/ihep.ac.cn/soft/lqcd/, currently the upper storage limit of this directory is 100GB.
  • Users from higgsgpu,junogpu,gpupwa,mlgpu,bldesign could install your software under /hpcfs, directory paths for each group can be found in Step3.
  • Users from other groups could install your software under /scratchfs, or other dedicated data directory from your experiment.
  • If there are any special software requirements, please contact the cluster admin.

Step3 Prepare your storage I/O directory

  • There is a dedicated I/O directory for GPU cluster users from the above mentioned groups.

    • Directory path for the group lqcd users: /hpcfs/lqcd/qcd/
    • Directory path for the group gpupwa users: /hpcfs/bes/gpupwa/
    • Directory path for the group junogpu: /hpcfs/juno/junogpu/
    • Directory path for the group mlgpu : /hpcfs/bes/mlgpu/
    • Directory path for the group higgsgpu: /hpcfs/cepc/higgs/
    • Directory path for the group bldesign : /hpcfs/heps/bldesign/
    • Directory path for the group ucasgpu: /hpcfs/cepc/ucas/
  • Input / Output files could be stored under the user's private sub-directory. Take lqcd group as an example, if there is a user zhangsan has Input / Output files, these files could be put under the directory /hpcfs/lqcd/qcd/zhangsan/.

  • For users who didn't find your data directory listed above, /scratchfs could be used as your data directory.

Step4 Prepare your job script

  • Job script is a bash script, and is consisted with two parts

    • Part 1 : Job prameters. Lines of this part is started with #SBATCH which are used to specify resource partition, QOS(job queue), number of required resources(CPU/GPU/memory), job name, output file path, etc.
    • Part 2 : job workload. For example, executable scripts, programs, etc.

    Attention!!

    • No commands lie between the lines start with #SBATCH and the line start with #!, or the job running parameters will be parsed by mistake, which could make the job get wrong resources allocated, and jobs will be failed at last.
    • Blank or comment lines could be filled between the #SBATCH lines and the #! line.
  • A job script sample is shown below.

#! /bin/bash

######## Part 1 #########
# Script parameters     #
#########################

# Specify the partition name from which resources will be allocated, mandatory option
#SBATCH --partition=gpu

# Specify the QOS, mandatory option
#SBATCH --qos=normal

# Specify which group you belong to, mandatory option
# This is for the accounting, so if you belong to many group,
# write the experiment which will pay for your resource consumption
#SBATCH --account=lqcd

# Specify your job name, optional option, but strongly recommand to specify some name
#SBATCH --job-name=gres_test

# Specify how many cores you will need, default is one if not specified
#SBATCH --ntasks=2

# Specify the output file path of your job
# Attention!! Your afs account must have write access to the path
# Or the job will be FAILED!
#SBATCH --output=/home/cc/duran/job_output/gpujob-%j.out

# Specify memory to use, or slurm will allocate all available memory in MB
#SBATCH --mem-per-cpu=2048

# Specify how many GPU cards to use
#SBATCH --gres=gpu:v100:2

######## Part 2 ######
# Script workload    #
######################

# Replace the following lines with your real workload
# For example to list the allocated hosts and sleep 3 minutes
srun -l hostname  
sleep 180

More Information

  • Specifications of --partitiion, --account, --qos options for each group
Group Job types --partition --account(normally same as the group) --qos
lqcd long jobs lgpu lqcd long
lqcd,gpupwa,higgsgpu,mlgpu,junogpu normal jobs gpu lqcd,gpupwa,higgsgpu,mlgpu,junogpu normal
lqcd,gpupwa,higgsgpu,mlgpu,junogpu debug jobs gpu lqcd,gpupwa,higgsgpu,mlgpu,junogpu debug
bldesign normal jobs gpu bldesign blnormal
bldesign debug jobs gpu bldesign bldebug
ucasgpu normal jobs ucasgpu ucasgpu ucasnormal
pqcd normal jobs pqcdgpu pqcd pqcdnormal
  • #SBATCH --mem-per-cpu option is used to specify required memory size. If this parameter is not given, default size is 4GB per CPU core, the maximum memory size is 32GB per CPU core. Please specify the memory size according to your practical requirements.

Explation for the option #SBATCH --time

  • It may take less queued time for the jobs, if --time option is sepecified
  • Especially for the jobs from group gpupwa whose job number is quite large.
  • To use --time option, the following lines could be modified and added in the job script
# To tell how long does it take to finish the job, e.g.: 2 hours in the following line
#SBATCH --time=2:00:00

# for the jobs will be run more than 24 hours, use the following time format
# e.g. : this job will run for 1 day and 8 hours
#SBATCH --time=1-8:00:00
  • For the users not experienced, run time statistics of historical jobs can be used as reference:
Group Run time Porbability
gpupwa <= 1 hour 90.43%
lqcd <= 32 hours 90.37%
junogpu <= 12 hours 91.24%
  • Jobs from group mlgpu and higgsgpu are small, it is recommended to use QOS debug, --time option could be omitted for now.
  • If Jobs ran longer than specified --time option, the Scheduling system will clean the overtimed jobs itself.
  • Sample job scripts could be found with the following path
/cvmfs/slurm.ihep.ac.cn/slurm_sample_script

Some comments

  • Sample job scripts are stored in the CVMFS filesystem, access CVMFS with the following commands:
# log into the lxslc7 nodes with your AFS account
$ ssh <AFS_user_name>@lxslc7.ihep.ac.cn

# Go to the directory where sample job scripts could be found
$ cd /cvmfs/slurm.ihep.ac.cn/slurm_sample_script

# get sample jobs scripts
$ ls -lht
-rw-rw-r-- 1 cvmfs cvmfs 1.4K 812 18:31 slurm_sample_script_gpu.sh

Step5 Submit your job

  • ssh login nodes.
# Issue ssh command to log in.
# Replace <AFS_user_name> with your user name.
$ ssh <AFS_user_name>@lxslc7.ihep.ac.cn
  • The command to submit a job:
# command to submit a job
$ sbatch <job_script.sh>

# <job_script.sh> is the name of the script, e.g: v100_test.sh, then the command is:
$ sbatch v100_test.sh

# There will be a jobid returned as a message if the job is submitted successfully

Step6 Check job status

  • The command to show job status is shown below.
# command to check the job queue
$ squeue

# command to check the jobs submitted by user
$ sacct -u <AFS_user_name>

Step7 Cancel your job

  • The Command to cancel a job is listed below.
# command to cancel the job
# <jobid> can be found using the command sacct
$ scancel <jobid>

results matching ""

    No results matching ""