10 FAQ
10.1 How to get help
There are several ways for users to seek help when they have problems with the computing platform.
a. Telephone service support (answering during working hours) : 88236855
b. Send email for consultation: helpdesk@ihep.ac.cn or ihep_computing_service@ihep.ac.cn
c. web consultation:http://helpdesk.ihep.ac.cn
10.2 login and account problems
1.When you login to lxlogin, cannot edit the file and ".Xauthority does not exist OR unauthorized" error appears.
After logging in lxlogin
-bash-4.2\$ kinit huqb
Password for huqb@IHEPKRB5:
-bash-4.2$ aklog -d
Authenticating to cell ihep.ac.cn (server afsdb1.ihep.ac.cn).
Trying to authenticate to user's realm IHEPKRB5.
Getting tickets: afs/ihep.ac.cn@IHEPKRB5
Using Kerberos V5 ticket natively
About to resolve name huqb to id in cell ihep.ac.cn.
Id 10517
Set username to AFS ID 10517
Setting tokens. AFS ID 10517 @ ihep.ac.cn
-bash-4.2\$ rm -f ~/.Xauthority
-bash-4.2\$ exit
logsin again.
or execute the following command to check if the quota exhausted.
fs listquota ~
Volume Name Quota Used %Used Partition
u07.huqb 500000 245490 49% 13%
2.I forgot my password. How can I reset it?
Users login https://login.ihep.ac.cn with the registered email of cluster account, change the password and use the new password to login to the computing environment again.
3. My password is correct, but I cannot log in normally
Send account questions to ihep_computing_service@ihep.ac.cn
4. My account has expired, how to deal with
Send the expired account information and extension application to the soft administrator of the experimental group, and cc ihep_computing_service@ihep.ac.cn. After receiving the reply from the soft administrator of the experimental group, the account manager will postpone the account.
5. Login cluster with aliases
Modify ~/.ssh/config
,and set SSH aliases。
Host lx
Hostname lxlogin.ihep.ac.cn
User user
Port 22
Host lhmtlogin
Hostname lhmtlogin.lhaaso.ihep.ac.cn
User user
Port 22
and then user can login clusters as following:
#### lxlogin.ihep.ac.cn
[user@localhost ~] ssh lx
#### lhmtlogin.lhaaso.ihep.ac.cn
[user@localhost ~] ssh lhmtlogin
10.3 Issues related to Jobs
1. Why does my job sit in 'I' or 'Idle' for too long time?
a) For fair share, if you run too many jobs during the past period, your priority would be decreased. In this case, please keep in patience, your job will run when your priority goes up.
b) IHEP computing cluster is quite busy all the time. In some cases, e.g. a large amount of jobs in job queue or many high-priority official jobs in the queue, your jobs have to be matched slowly.
c) Your job requests some special resources, the amount of this kind of resources is quite few, e.g. requesting long runtime resource or big memory. In this case, please confirm the requests of your job should be correct and precise.
d) Beyond the above cases, please contact with the admins of cluster or submit a help ticket to helpdesk (helpdesk.ihep.ac.cn).
2. Scientific Linux 5 is needed by my jobs, but there are only login machines with scientific linux 6 or 7, how can I run my SL5 job?
Due to the security concern, IHEP cluster does not provided the SL5 login machine, instead by the SL5 container. Please find more information form [Container User Manual](../local-cluster/jobs/container/README.md).
3. Why does my job sit in 'H' or 'Held'?
When your job was detected problems by job server, your job would be placed in 'H' or 'Held'. You can re-check the hold reason with the following command:
```shell
$ hep_q -i $JobID -hold
or
$ hep_q -u $User -hold
```
If the hold reason is landing on /afs or /workfs 'permission denied', it means your job is attempting to write data in a file of /afs or /workfs. In this case, delete your job, changed the path your job need to write in, re-submit job.
If the hold reason is like "Job has gone over memory limit of xxxx megabytes", it means your job occupied too much physical memory. In this case, delete your job, optimized the memory consume of your job, re-submit job with a proper memeory request.
Any adoubt, please keep the problem scene and submit the problem details to [helpdesk](helpdesk.ihep.ac.cn).
4. What information I should provide to cluster admins if I have questions about my job?
Please provide the information as much as you can, job id, the approximate submission time, the error log, job script and so on. Please keep the scene, it's helpful for administrators to follow up your problem.
5. How to check the resource status which is being used by my job?
As the targeted cluster, please refer to [HTCondor job](../local-cluster/jobs/HTCondor/README.md), [Slurm job](../local-cluster/jobs/slurm/README.md) or [Hadoop job](../local-cluster/jobs/hadoop/README.md).
6. ATLAS Job was hold after submit.
The error message :
```shell
$ Hold reason: Failed to convert environment to target syntax for starter (opsys=LINUX): ERROR: Missing '=' after environment variable ': Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz’.
```
It was caused by ATLAS environment variable. There are several space in value of this variable. The solution is delete this variable:
```shell
$ unset "ALRB_infoProc"
```
10.4 Issues related to disk storage
1. My directory can't be written normally?
Either the user's personal directory or the user group's public directory is set to the maximum available capacity. When the used space exceeds the maximum available capacity, the relevant person will receive an email reminder that he/she needs to clean up the files in the directory as soon as possible. If the capacity is exceeded, the directory will be locked. At that time, it is necessary to contact the staff of the computing center to unlock and then clean up the documents.
2. My file was accidentally deleted. Can I recover it?
Some directories of EOS storage have recycle bin function,the command is 'eos recycle restore'. Also you can contact Li haibo-88236883 or Bi yujiang-88236838 for recovery. If the file has a backup, it can be recovered through the backup recovery service.
3. Which directories are backed up? How do I recover files from a backup?
The directory with backup is listed in backup service-backup directory of each application. You can use Amanda for data recovery.
4. How to view the storage space I have used?
AFS storage user can use fs quota /afs/ihep.ac.cn/users/u/user
to view the user's share of user
. The EOS storage user can use eos quota
command.