7.2 LHAASO
Overview
The LHAASO experiment currently has two computing clusters, the Beijing local computing cluster and the Daocheng Haizishan computing cluster. Beijing local computing cluster is the central site for LHAASO experimental data processing, responsible for the storage, processing, analysis and long-term preservation of LHAASO experimental data; Daocheng Haizishan computing cluster is mainly used for rapid preprocessing and compression of online data.
Since the Daocheng computing cluster is mainly used by special users, this manual mainly introduces the usage of Beijing local computing cluster.
7.2.1 Computing Services
Apply for a computing cluster account
Please refer to
Chapter 2 Contents
, after obtaining the consent of the person in charge of LHAASO experimental calculation, users can obtain an afs account.Assignment submission and management
Please refer to section [
3.2.1
] (../../local-cluster/jobs/HTCondor/README.md) for instructions on job submission;In addition, the computing platform also provides LHAASO users to run Hadoop jobs, please refer to
3.2.3
section for instructions on hadoop job submission.
7.2.1.1 Beijing computing cluster job submission
- Job submission
$hep_sub -g lhaaso jobfile
Job query
$hep_q -g lhaaso -u userid
Job deletion
$hep_rm -g lhaaso jobid
requires attention:
-g lhaaso
is not a required option. If the primary group of the cluster account is lhaaso, you do not need to add this option. The method to view the cluster account master group is:
$id username
For example, to view the information of the km2amc account:
$ id km2amc
uid=13014(km2amc) gid=580(lhaaso) groups=580(lhaaso),340(lhaasorun)
Here, gid is the main group of the km2amc account, and here is lhaaso, you do not need to add the -g lhaaso
option to submit the job.
If you want to change the main group of the cluster account to lhaaso, please send an email application to ihep_computing_service@ihep.ac.cn and send a copy to Mr. Chaoyong Wu wucy@ihep.ac.cn
7.2.1.1.1 Advanced usage of job submission
For jobs that need to be submitted in batches, the following methods can be used to submit jobs, which can greatly optimize the submission efficiency.
Example description: The problem to be solved by this example is that when there are many sub-files in the original data directory, it is necessary to generate job scripts in large quantities, and then submit the scripts to run, causing the cluster to freeze.
Solution: Use batch submission with parameters to reduce the number of scripts to a few.
Two methods are provided here: non-Cluster submission and Cluster submission. Among them, the cluster method is currently supported by hep_sub, which can greatly improve the speed of job submission.
For example, there are many dat files in the original directory, and the jobs need to be submitted separately.
Raw_1.dat Raw_2.dat Raw_3.dat ...
Method 1) Submit in a non-Cluster way
1) First, you need to write the running script (test.sh) with external input variables to be executed in advance
$cat test.sh
rawfile=you want
outfile=you want
./run.exe ${rawfile}/${1} ${outfile}/${1}.root
2) Submit job with parameters (subjob.sh)
$cat subjob.sh
run=1
run1=3
while [$run -le $run1]
do
hep_sub test.sh -g lhaaso -argu “Raw_${run}.dat”
run=`expr $run + 1`
done
Method 2) Submit using Cluster
1) First, you need to write the running script (test.sh) with external input variables to be executed in advance
$cat test.sh
procid=$1
raw_id=$(expr $1 + 1)
input=”Raw_${raw_id}.txt”
rawfile=you want
outfile=you want
./run.exe ${rawfile}/${input} ${outfile}/${1}.root
2) Use the Cluster method to submit the job (subjob.sh)
$hep_sub test.sh -g lhaaso -argu "%{ProcId}" -n 3
7.2.2 Storage Service
For different scenarios used in experimental data, computing clusters provide different storage systems. Currently, it is divided into four categories according to data access scenarios, which are home directory, software directory, experimental data storage directory, and user script storage directory.
Data Storage Scenario | Purpose | Directory Name | Space Quota/File Quantity Quota | Access Method | View Usage Share Command |
---|---|---|---|---|---|
Home Directory | afs account configuration file, script | /afs/ihep.ac.cn/users/a-z/username | 500MB/none | Direct access | fs listquota /afs/ihep.ac.cn/users/username |
Software catalog | Public software | /cvmfs/lhaaso.ihep.ac.cn/anysw | - | Direct access | Only experimental contacts have permission to publish software |
Experimental data storage | Personal experimental data storage | /eos/user/a-z/username | 1TB/250,000 | No direct access, access via xrootd | eos quota |
Used to store experimental raw data, reconstruction data, part of the analysis data, simulation data, calibration data | /eos/lhaaso | - | Cannot be accessed directly, need to be accessed via xrootd | ||
User script storage directory | Personal job scripts, configuration files and job .o and .e output, etc. | /lhaaosofs/user/username | 200GB/none | Direct access | lfs quota -u username /scratchfs |
Experimental data, such as some simulated job data, cannot be accessed using the xrootd protocol | /lhaasofs/data | - | Direct access |
When applying for a cluster account, users usually only have /afs/ihep.ac.cn/users/az/username and /eos/user/az/user. If you need to apply for the /lhaaosofs/user/username directory, please contact Chen Songzhan (chensz@ ihep.ac.cn) and (zham@ihep.ac.cn) two teachers apply. After the application is approved, the directory account will be created.
7.2.2.1 /eos/lhaaso experimental data directory description
Directory Name | Space Quota/File Quantity Quota | Command to View Usage Share | Storage Purpose | Person in Charge |
---|---|---|---|---|
/eos/lhaaso/raw | none/none | eos quota /eos/lhaaso/raw | Raw data | Computing Center |
/eos/lhaaso/spt | none/none | eos quota /eos/lhaaso/spt | Single particle data of each experiment | Computing Center |
/eos/lhaaso/decode | none/none | eos quota /eos/lhaaso/decode | decode data | LHAASO group |
/eos/lhaaso/rec | none/none | eos quota /eos/lhaaso/rec | Rebuild data | LHAASO group |
/eos/lhaaso/cal | none/none | eos quota /eos/lhaaso/cal | Calibration parameter data | LHAASO group |
/eos/lhaaso/monitor | none/none | eos quota /eos/lhaaso/monitor | Data quality monitoring data and moon shadow and Crab monitoring data | LHAASO group |
Set km2a wcda km2a wcda km2a wcda wcdapls wfcta wfcta under each of the above directories, and each directory is in charge of the corresponding laboratory personnel. The data is stored in a directory of year/month/day/classification. Under each experiment directory, the corresponding person in charge creates the workshell and software directories, respectively saves the scripts and original programs generated by the corresponding data, and establishes concise instructions for use later, which is convenient for subsequent repetition and use. The corresponding person in charge regularly reports the status and problems. corresponding
7.2.2.2 Tape data
The Castor and stager server currently used by LHAASO is nslhaaso.ihep.ac.cn, and the tape library server is tplhaaso01.ihep.ac.cn. The tape backup directory is in /castor/ihep.ac.cn/lhaaso/raw
, and currently it is stored according to year/month
that is yyyy/mm
.
7.2.2.3 EOS instructions
The EOS file system is a distributed file system for EB-level data storage based on the xrootd framework. For details, please refer to 3.3.5 EOS file storage
.
1) EOS access files
The xrootd protocol access EOS system, the access method is:
root://EOS_MGM_URL//filepath
Description:
-Xrootd uses port 1094 by default -Need to use "//" after the port -filepath needs to use absolute path
Here EOS_MGM_URL is the server address of the instance, the Beijing EOS cluster server address is root://eos01.ihep.ac.cn, and the Daocheng EOS cluster server address is root://lhmteos01.lhaaso.ihep.ac.cn, The environment variable has been configured in the login node and the computing node, and it can be found through the following command:
Beijing Cluster:
$ echo $EOS_MGM_URL
root://eos01.ihep.ac.cn
If it does not exist, please set it yourself:
$export EOS_MGM_URL=root://eos01.ihep.ac.cn
Daocheng cluster:
$ echo $EOS_MGM_URL
root://lhmteos01.lhaaso.ihep.ac.cn
If it does not exist, please set it yourself:
$export EOS_MGM_URL=root://lhmteos01.lhaaso.ihep.ac.cn
Note: When the user is using the /eos, they can only access it through the eos command.
eos command and system command comparison table
Eos command (recommended) | Linux command (not available) | Description |
---|---|---|
eos ls | ls | View file list |
eos cp | cp | copy files |
eos mv eos file rename | mv | Move file |
eos cp /eos/user/myfile -|cat | cat | View file content |
eos cp /eos/user/myfile -|tail | tail | View file content |
eos mkdir | mkdir | Create folder |
eos touch | touch | Create file |
eos newfind -f /eos/mypath | None | Find a list of all files in a directory (including subdirectories) |
eos newfind -d /eos/mypath | None | Find a list of all directories (including subdirectories) under a certain directory |
eos ln | ln | Create soft link |
eos quota | None | View personal space usage (/eos/user/a-z/username) |
2) Visit EOS during assignment
The premise of using xrootd in the job is that the premise of using xrootd to read and write files in the job is that the physical software needs to support the xrootd protocol. For ROOT software, the xrootd method is currently supported, but it should be noted that there are currently three ways to generate Tfile objects:
- Statement:
TFile(PATHNAME)
--xrootd is not supported - New file:
new TFile (PATHNAME)
- xrootd is not supported - Open method:
TFile::Open(PATHNAME)
--support xrootd
For example:
For files in ROOT format, you can directly use TFile::Open to open:
TFile *filein = TFile::Open(root://eos01.ihep.ac.cn//eos_absolute_path_filein_name.root)
For files in non-ROOT format, you can also use ROOT's TFile class to read and write directly, you need to add a parameter "?filetype=raw" after the file name. For example:
void rawfile(){
int size;
char buf[1024];
TFile *rf = TFile::Open("root://eos01.ihep.ac.cn//eos/user/c/chyd/set.log?filetype=raw");
size = rf->GetSize();
printf("size is %d\n", size);
memset(buf, 0, 1024);
rf->ReadBuffer(buf, 1024);
printf("%s\n", buf);
rf->Close();
}
3) eoshadd command
hadd supports merging files through xrootd. For this reason, we provide a tool eoshadd that supports to merge files on the /eos using xrootd. With this tool, users do not need to add xrootd prefix, but can still use the original hadd method. To merge the files. Currently the tool is placed in /cvmfs/common.ihep.ac.cn/software/customized_script/storage/eos/eoshadd, and the specific usage is:
$ eoshadd /eos/target_root_file /eos/dir1/\*.root /eos/dir2/\*.root
7.2.3 LHAASO submission job example
1) Application: Support xrootd access/eos disk file
- Make sure to use TFile::Open() to open the file when using ROOT
- For files on the EOS disk, add EOS_MGM_URL="root://eos01.ihep.ac.cn/" prefix when inputting parameters
- Remember to call TFile::Close() to close the file
2) Job script: batch submission
- hep_sub test.sh -g lhaaso -argu "%{ProcId}" -n number
7.2.4 Notes
1) Do not have too many files (more than tens of thousands) of data or scripts in a single directory. It is recommended to create subdirectories according to certain rules and place files in subdirectories. The number of files in a single directory is controlled within 1000.
2) Avoid operations such as ls * or rm * in the job. If you really need to use it, it is recommended to generate a file list to the directory that needs to be used on the login node in advance. Access data by looking up the file list. In addition, in the login node, if you only need to view the file name information, you can use /bin/ls instead of the ls command to speed up the speed. If you need to view the /eos directory, use "eos ls directory absolute path", which will be faster.
3) The task script is written as a template, and the file name, data directory and other program parameters to be analyzed are passed as script parameters to the script when submitted with hep_sub, such as hep_sub my_job.sh -argu "aaa" "bbb", Instead of generating many similar scripts, such as my_job_aaa.sh, my_job_bbb.sh.
4) Put the generated data file directly on eos (you can use xrootd to read and write the file), and you can use eos file rename to move the file to the specified directory later.
5) For hadd operations, avoid using "hadd *.root", so if there are many files in the directory, the pressure on the file system will be very large. It is recommended to generate the files that need to be merged into a list in advance, and then use hadd to merge; and for the root file exists on /eos, because the hadd operation can support xrootd mode, so only need to add before the file name (absolute path) The prefix root://eos01.ihep.ac.cn/ is fine.
6) Small personal program files (a few MB) such as my_program, if it is not stored on cvmfs, you can use the eos cp command in the script to copy the program to the /tmp/ directory of the running node, and you can judge before copying If it exists, copy it if it does not exist, and then run /tmp/my_program my_para.
7.2.5 Daocheng cluster use
Note: Daocheng's current personal directory will be completely cleared after the WCDA starts a more formal quick rebuild, only the two public directories wcdarec and wcdaplsrec (in theory) are retained.
1) Daocheng computing cluster description
At present, the Daocheng computing cluster has 1748 CPU cores, 1.63PB of storage space, and 2 login nodes.
2) Steps for submitting Daocheng cluster work
a) Use the afs account to log in to the Daocheng login node
Node name: lhmtlogin01.lhaaso.ihep.ac.cn, lhmtlogin02.lhaaso.ihep.ac.cn
Login method: afs account name, password
The wcdarec account has been created and the path is /eos/daocheng/user/w/wcdarec.
b) Copy job script
Currently, there are two mount directories on the Daocheng compute node, one is the afs directory, and the other is the storage directory of /eos/daocheng. Therefore, you need to copy the relevant job scripts to /eos/daocheng through the rsync command.
c) Submit homework
Since the afs directory cannot write files, if you want to use the afs directory to submit a job, you need to redirect the job output and error output file locations.
You can directly submit the job under the /eos/daocheng directory.
d) Job copy back
Currently only supports copying back to local via rsync.
3) Job submission and query commands
submit homework:
$hep_sub -site daocheng -g lhaasorun -os SL7 myjob.sh
Query job:
$hep_q -site daocheng -u username
Delete job:
$hep_rm -stie daocheng jobid
View job held reason:
$hep_q -site daocheng -u username -hold
7.2.5.1 XCache cache proxy service
In order to reduce the access delay, the computing center provides XCache
caching proxy service for LHAASO users to access Daocheng EOS data. The server address is lhaasocache.ihep.ac.cn
. The method of use is also very simple. When accessing Daocheng data, simply replace the Daocheng EOS address with the previous address.
root://lhmteos01.lhaaso.ihep.ac.cn//eos/daocheng/user/l/lihaibo/bill.root
Replace with:
root://lhaasocache.ihep.ac.cn//eos/daocheng/user/l/lihaibo/bill.root
1) Log in to the node in Beijing and access Daocheng EOS cluster data through root
Log in to lxlogin and execute: $root -l root://lhaasocache.ihep.ac.cn//eos/daocheng/user/l/lihaibo/bill.root
2) View the data of the EOS cluster in Daocheng
First log in to lxlogin and execute: $xrdfs root://lhaasocache.ihep.ac.cn/
After entering the command line of xrdfs, you can type help to view related commands, such as viewing directory conditions: [lhmteos01.lhaaso.ihep.ac.cn:1094] /> ls /eos/daocheng/raw
note:
- The currently used cache proxy service only provides read-only permissions, that is, only files on the Daocheng EOS cluster can be viewed, but cannot be modified.
- Since it is currently a cache proxy service, the speed of the initial access will not increase. As data access becomes more and more frequent, the more data is cached, and the higher the cache hit rate. The file access speed will also be significantly improved by then.
FAQ
- Do not source files of file system (such as /workfs,/) in ~/.bashrc, because this will cause login failure or too long login time if a storage file system disk gets stuck when logging in to the cluster long. For example, this is not recommended:
cat ~/.bashrc
source /workfs/ybj/username
- Some jobs show "X" status after being deleted. For this kind of jobs, you need to use
hep_rm -forcex jobid
to delete jobs in this state.