Re: Memory problem in supercomputers ( No.1 ) |
- Date: 2021/06/16 19:47
- Name: Naoya Yamaguchi
- Hi,
Your problem might be close to http://www.openmx-square.org/forum/patio.cgi?mode=view&no=2619
Especially, as mentioned in the above thread, when several MD steps are finished, you can continue calculations by restarting with `*.dat#`.
Regards, Naoya Yamaguchi
|
Re: Memory problem in supercomputers ( No.2 ) |
- Date: 2021/06/17 08:55
- Name: T. Ozaki
- Hi,
I wonder that the optimization of a system including 700 atoms may be doable using 16 nodes. How many cores and how much memory does each node have on the computer? Why did you speficy 152 MPI processes which is not divisible by 16 nodes? A simple prescription is to reduce the number of threads from 4 to 2 or 1.
If you show us your shell script and input file, we may be able to give you more proper suggestions.
Regards,
TO
|
Re: Memory problem in supercomputers ( No.3 ) |
- Date: 2021/07/21 10:44
- Name: Ninomiya
- Dear Prof.Ozaki and Prof.Yamaguchi
Thank you for your attention and help.
I’m using the SQUID supercomputer at Osaka University. (http://www.hpc.cmc.osaka-u.ac.jp/system/manual/squid-use/) I’m using 76 cores per node. 248GB of memory is available per node. (http://www.hpc.cmc.osaka-u.ac.jp/system/manual/squid-use/jobclass/)
I have computed a supercell system with Atoms.Number = 687 on a supercomputer. I am a beginner in openmx and supercomputers, so I may have made some mistakes in the calculation conditions. As advised, I reduced the number of threads to 2 and submitted 160 mpi processes, divisible by 16 nodes. The shell script and input file were set up as follows
shell script (http://www.hpc.cmc.osaka-u.ac.jp/system/manual/squid-use/jobscript/); ======== #!/bin/bash #PBS -q SQUID #PBS --group= #PBS -l elapstim_req=05:00:00 #PBS -T intmpi #PBS -b 16 #PBS -v OMP_NUM_THREADS=2
module load BaseCPU/2021 module load BaseApp/2021 module load OpenMX/3.9 cd $PBS_O_WORKDIR mpirun ${NQSV_MPIOPTS} -np 160 openmx 6.dat## -nt 2 > 6.std ========
input file; ======== System.CurrrentDirectory ./ System.Name 6 level.of.stdout 1 level.of.fileout 1
DATA.PATH /system/apps/rhel8/cpu/OpenMX/intel2020u4/3.9/openmx3.9/work/DFT_DATA19
Species.Number 2 <Definition.of.Atomic.Species O O6.0-s2p2d1 O_PBE19 Ti Ti7.0-s3p2d1 Ti_PBE19 Definition.of.Atomic.Species>
Atoms.Number 687 Atoms.SpeciesAndCoordinates.Unit Ang <Atoms.SpeciesAndCoordinates
Atoms.SpeciesAndCoordinates>
Atoms.UnitVectors.Unit Ang <Atoms.UnitVectors 10.9272740 0.0000000 0.0000000 -4.5171250 13.5419169 0.0000000 44.8710433 51.1749099 47.2458760 Atoms.UnitVectors>
scf.XcType GGA-PBE scf.SpinPolarization off scf.ElectronicTemperature 300.0 scf.energycutoff 150.0 scf.maxIter 100 scf.EigenvalueSolver band scf.Kgrid 2 2 1 scf.Mixing.Type rmm-diisk
scf.Init.Mixing.Weight 0.01 scf.Min.Mixing.Weight 0.01 scf.Max.Mixing.Weight 0.2 scf.Mixing.History 40 scf.Mixing.StartPulay 20 scf.Mixing.EveryPulay 1 scf.Kerker.factor 5.0
scf.criterion 1.0e-6 scf.maxIter 100
MD.Type EF MD.Opt.DIIS.History 3 MD.Opt.StartDIIS 5 MD.Opt.EveryDIIS 200 MD.maxIter 100 MD.Opt.criterion 0.0003 ========
The status of the job during the calculation was as follows. ======== ReqName Queue Pri STT S Memory CPU Elapse R H M Jobs -------- -------- ---- --- - -------- -------- -------- - - - ---- openmx.sh SC64 0 RUN - 772.23G 248398.16 850 Y Y Y 16 ========
When I set the number of threads to 2 and ran the calculation, the calculation finished in 20 minutes with the following error. ======== Exceeded per-job memory size warning limitq ========
Should I restart several times in a short period of time? I would appreciate your advice on shell scripts and input files.
Best regards, Ninomiya,
|
Re: Memory problem in supercomputers ( No.4 ) |
- Date: 2021/06/17 13:05
- Name: T. Ozaki
- Hi,
To me, the resource you are using looks enough to run the calculation.
(1) Can I assume that you could successfully run openmx with other input files on SQUID? If so, please let us know where openmx crashed. Just at the first SCF step, or after several geometry optimization steps? This can be checked by looking at the stdout or err file from the queuing system.
(2) You seem not to specify the following options:
#PBS -l cpunum_job #PBS -l memsz_job
ref.: http://www.hpc.cmc.osaka-u.ac.jp/system/manual/squid-use/jobscript/
In this case, are the maximum values are automatically set?
(3) Do you know how many MPI processes is allocated to each node by the option: #PBS -l cpunum_job?
(4) As for the input file, I think that the gamma-point calculation is enough such as scf.Kgrid 1 1 1 Also, decreasing scf.Mixing.History from 40 to 30 may reduce the memory requirement.
Regards,
TO
|
Re: Memory problem in supercomputers ( No.5 ) |
- Date: 2021/07/21 10:44
- Name: Ninomiya
- Dear Prof.Ozaki.
Thank you for your attention and help.
(1) Yes, I was able to finish the calculation for the model with 48 atoms.
In the .std file, it stops at the following part. ******************* MD= 1 SCF=31 ******************* <Poisson> Poisson's equation using FFT... <Set_Hamiltonian> Hamiltonian matrix for VNA+dVH+Vxc... <Band> Solving the eigenvalue problem... KGrids1: -0.25000 0.25000 KGrids2: -0.25000 0.25000 KGrids3: 0.00000
(2), (3) Yes. #PBS -l cpunum_job and #PBS -l memsz_job are automatically set to the maximum value if not set. The upper limit of the submitted job class will be set.
(4) Thanks for your advice. I will change the conditions and try to calculate.
Best regards, Ninomiya,
|
Re: Memory problem in supercomputers ( No.6 ) |
- Date: 2021/06/17 17:25
- Name: Naoya Yamaguchi
- Dear Ninomiya-san,
Although your job script specified `6.dat##` as an input file, is the calculation with the above error a restart calculation? If yes, were the first (`6.dat`) and second (`6.dat#`) calculations normally finished?
Regards, Naoya Yamaguchi
|
Re: Memory problem in supercomputers ( No.7 ) |
- Date: 2021/06/17 18:06
- Name: T. Ozaki
- Hi,
By the following quetion:
> (3) Do you know how many MPI processes is allocated to each node by the option: #PBS -l cpunum_job?
I intended to ask how we can know the number of MPI processes allocated to each node.
With the following specification:
#PBS -b 16 #PBS -v OMP_NUM_THREADS=2
mpirun ${NQSV_MPIOPTS} -np 160 openmx 6.dat## -nt 2 > 6.std
can we regard as 160 MPI processes/16 nodes = 10 MPI processes/node? If so, you are using 10 MPI processes x 2 OMP threads = 20 CPU cores/node.
Then, according to your explanation, no specification for the keyword: #PBS -l cpunum_job allocates 76 CPU cores/node. Then, 76-20 = 56 CPU cores will be idle, and may waste memory unexpectedly. It might be better to explicitly specify the number of CPU cores using the keyword: #PBS -l cpunum_job. In your case, it should be 20.
Also, does "openmx 6.dat## -nt 2 > 6.std" make sense, since # tends to be regarded as a flag of comment in shell script ?
Regards,
TO
|
Re: Memory problem in supercomputers ( No.8 ) |
- Date: 2021/07/21 10:45
- Name: Ninomiya
- Dear Prof.Yamaguchi,
Yes, the calculation with the above error is a restart calculation. The first (`6.dat`) and second (`6.dat#`) calculations were not normally finished. The calculation is restarted from the calculation that was terminated with an error.
Best regards, Ninomiya,
|
Re: Memory problem in supercomputers ( No.9 ) |
- Date: 2021/06/19 18:40
- Name: Naoya Yamaguchi
- Dear Ninomiya-san,
I think that, as Prof. Ozaki says, you should review the job script. Basically, you should set the numbers of MPI processes and OpenMP threads satisfying the product is equal to the number of cores included in allocated nodes. If not, you might need to set additional options appropriately such as cpunum_job or cpunum-lhost (https://www.hpc.nec/documents/nqsv/pdfs/g2ad03-NQSVUG-Operation.pdf).
Regards, Naoya Yamaguchi
|
Re: Memory problem in supercomputers ( No.10 ) |
- Date: 2021/07/21 10:45
- Name: Ninomiya
- Dear Prof.Ozaki and Prof.Yamaguchi
Thank you for your attention and help. I will review the job script.
Best regards, Ninomiya,
|