convergence problems with OpenMX 3.6

Top Page > Browsing

convergence problems with OpenMX 3.6

Date: 2011/11/11 18:40
Name: Denis Music <music@mch.rwth-aachen.de>: Dear Prof. Ozaki and OpenMX users,
I’m trying to install the new version of the code (3.6) using:
CC = mpicc -openmp -I/usr/local/include -I$(HOME)/include
LIB = -L$(HOME)/lib -lfftw3 -lacml -lI77 -lg2c -L/opt/gcc33/lib64
This gave no problems with the old version (3.5), but the installation of the new version (3.6) went through with some minor errors regarding the memory leak, which I thought is not critical.
However, the run test fails heavily:
1 input_example/Benzene.dat Elapsed time(s)= 14.78 diff Utot= 0.000000005260 diff Force= 0.000000000005
2 input_example/C60.dat Elapsed time(s)= 88.32 diff Utot= 0.000000060210 diff Force= 0.000000002302
3 input_example/CO.dat Elapsed time(s)= 36.22 diff Utot= 0.000000005355 diff Force= 0.000000013481
4 input_example/Cr2.dat Elapsed time(s)= 54.39 diff Utot= 6.849432889193 diff Force= 0.011146108171
5 input_example/Crys-MnO.dat Elapsed time(s)= 78.63 diff Utot=5479006417623586818567962624.000000000000 diff Force=8462480737302404222943232.000000000000
6 input_example/GaAs.dat Elapsed time(s)= 437.81 diff Utot=231612279294518.000000000000 diff Force=28340589505089110016.000000000000
7 input_example/Glycine.dat Elapsed time(s)= 19.66 diff Utot= 0.000000024529 diff Force= 0.000000000092
8 input_example/Graphite4.dat Elapsed time(s)= 13.16 diff Utot=11172197689683591599149285376.000000000000 diff Force=919815681676781958463488.000000000000
9 input_example/H2O-EF.dat Elapsed time(s)= 16.37 diff Utot= 0.000000013254 diff Force= 0.000000000342
10 input_example/H2O.dat Elapsed time(s)= 11.11 diff Utot= 0.000000013229 diff Force= 0.000000005887
11 input_example/HMn.dat Elapsed time(s)= 67.87 diff Utot= 0.000000000457 diff Force= 0.000000000112
12 input_example/Methane.dat Elapsed time(s)= 10.70 diff Utot= 0.000000006360 diff Force= 0.000000001506
13 input_example/Mol_MnO.dat Elapsed time(s)= 34.09 diff Utot= 0.000000002437 diff Force= 0.000000000367
14 input_example/Ndia2.dat Elapsed time(s)= 21.43 diff Utot=152309628048116.812500000000 diff Force=56772376902632472576.000000000000
I’d appreciate any advice you may have.
With best regards,
Denis

Page: [1]

Re: convergence problems with OpenMX 3.6 ( No.1 )

Date: 2011/11/11 21:01
Name: T.Ozaki

Hi,

Could you try -O0 and -O1 as compiler option, and report again?
Also, could you tell us the error related to memory leak that
you noticed?

Thank you for cooperation in advance.

Regards,

TO

Re: convergence problems with OpenMX 3.6 ( No.2 )

Date: 2011/11/11 22:18
Name: N.Kolchenko <nkolchenko@mail.ru>

Hi,

Yes, with -O0 or -O1 option results are correct.

It seems, that name of the problem is "Band solver in MPI mode". (May be more precisely - "the systems with PBC in MPI mode").

Without MPI "Band solver" work normally. With MPI (and -O2 or -O3 compiler option) the above mentioned (5,6,8,14 input_example) abnormal results are obtained for both 3.6 and 3.5(!) versions.

Regards,

NK

Re: convergence problems with OpenMX 3.6 ( No.3 )

Date: 2011/11/11 23:33
Name: T.Ozaki

Hi,

We have never seen such an abnormal result on our computational facilities.
I would like to know more details regarding what kinds of computational environment
such as processor, OS (version), compiler (version), linked libraries (version) and
etc. causes this kind of problems.

Also, the error may be related to the other report at
http://www.openmx-square.org/forum/patio.cgi?mode=view&no=1309

Thank you very much for your cooperation in advance.

Regards,

TO

Re: convergence problems with OpenMX 3.6 ( No.4 )

Date: 2011/11/12 00:03
Name: Denis Music <music@mch.rwth-aachen.de>

Hello again,
Thanks a lot for the input! I recompiled the code with the -O1 option and the only error message (as before) is:
Memory_Leak_test.c(299): warning #181: argument is incompatible with corresponding format string conversion
printf("Used_VSZ (kbyte) = %6d\n", (long int)(Used_VSZ));fflush(stdout);
Memory_Leak_test.c(300): warning #181: argument is incompatible with corresponding format string conversion
printf("Used_RSS (kbyte) = %6d\n", (long int)(Used_RSS));fflush(stdout);
Now the convergence nightmare is gone. It’s all much better (acceptable in my opinion).
1 input_example/Benzene.dat Elapsed time(s)= 15.85 diff Utot= 0.000000005260 diff Force= 0.000000000005
2 input_example/C60.dat Elapsed time(s)= 97.06 diff Utot= 0.000000060208 diff Force= 0.000000002300
3 input_example/CO.dat Elapsed time(s)= 41.77 diff Utot= 0.000000005357 diff Force= 0.000000013463
4 input_example/Cr2.dat Elapsed time(s)= 27.63 diff Utot= 0.000000006281 diff Force= 0.000000000003
5 input_example/Crys-MnO.dat Elapsed time(s)= 61.89 diff Utot= 0.000000003682 diff Force= 0.000000000003
6 input_example/GaAs.dat Elapsed time(s)= 249.43 diff Utot= 0.000000006088 diff Force= 0.074246673815
7 input_example/Glycine.dat Elapsed time(s)= 62.08 diff Utot= 0.000000024529 diff Force= 0.000000000092
8 input_example/Graphite4.dat Elapsed time(s)= 27.09 diff Utot= 0.000000032393 diff Force= 0.000000000126
9 input_example/H2O-EF.dat Elapsed time(s)= 30.96 diff Utot= 0.000000013254 diff Force= 0.000000000341
10 input_example/H2O.dat Elapsed time(s)= 53.20 diff Utot= 0.000000013228 diff Force= 0.000000012034
11 input_example/HMn.dat Elapsed time(s)= 205.25 diff Utot= 0.000000000457 diff Force= 0.000000000112
12 input_example/Methane.dat Elapsed time(s)= 16.92 diff Utot= 0.000000006360 diff Force= 0.000000001502
13 input_example/Mol_MnO.dat Elapsed time(s)= 61.72 diff Utot= 0.000000002437 diff Force= 0.000000000366
14 input_example/Ndia2.dat Elapsed time(s)= 25.68 diff Utot= 0.000000017584 diff Force= 0.000000000002

With best regards,
Denis

Re: convergence problems with OpenMX 3.6 ( No.5 )

Date: 2011/11/12 01:23
Name: Denis Music <music@mch.rwth-aachen.de>

Hi,
Some info about the computational environment I use:
Scientific Linux 6.1, Intel Xeon, acml-3-6-0-gnu-64bit, Intel MPI
There’s another problem that I have discovered. It seems that the md sequence is somehow incoherent. Please have a look at the following (melt at high temperatures):
…
time= 14.000 (fs) Energy= -2383.47584 (Hatree) Temperature= 7805.914
time= 14.000 (fs) Energy= -2383.47549 (Hatree) Temperature= 7808.225
time= 26.000 (fs) Energy= -2382.48904 (Hatree) Temperature= 5190.235
time= 15.000 (fs) Energy= -2383.03728 (Hatree) Temperature= 7067.631
time= 14.000 (fs) Energy= -2383.47575 (Hatree) Temperature= 7808.423
time= 29.000 (fs) Energy= -2382.98479 (Hatree) Temperature= 6005.790
time= 15.000 (fs) Energy= -2383.03634 (Hatree) Temperature= 7070.145
time= 27.000 (fs) Energy= -2382.62941 (Hatree) Temperature= 5450.185
time= 29.000 (fs) Energy= -2382.98222 (Hatree) Temperature= 6005.147
time= 27.000 (fs) Energy= -2382.63118 (Hatree) Temperature= 5450.336
time= 15.000 (fs) Energy= -2383.03635 (Hatree) Temperature= 7070.085
time= 30.000 (fs) Energy= -2383.15085 (Hatree) Temperature= 6282.041
time= 30.000 (fs) Energy= -2383.15253 (Hatree) Temperature= 6281.631
time= 23.000 (fs) Energy= -2382.01161 (Hatree) Temperature= 4595.027
time= 29.000 (fs) Energy= -2382.98249 (Hatree) Temperature= 6005.417
time= 27.000 (fs) Energy= -2382.63166 (Hatree) Temperature= 5478.975
time= 30.000 (fs) Energy= -2383.15240 (Hatree) Temperature= 6282.250
time= 28.000 (fs) Energy= -2382.80883 (Hatree) Temperature= 5723.596
time= 30.000 (fs) Energy= -2383.15002 (Hatree) Temperature= 6281.650
time= 31.000 (fs) Energy= -2383.31392 (Hatree) Temperature= 6548.080
time= 31.000 (fs) Energy= -2383.31410 (Hatree) Temperature= 6547.894
time= 28.000 (fs) Energy= -2382.81090 (Hatree) Temperature= 5723.647
time= 30.000 (fs) Energy= -2383.15069 (Hatree) Temperature= 6281.873
time= 28.000 (fs) Energy= -2382.80936 (Hatree) Temperature= 5781.812
time= 31.000 (fs) Energy= -2383.31127 (Hatree) Temperature= 6548.545
time= 31.000 (fs) Energy= -2383.30871 (Hatree) Temperature= 6547.873
time= 29.000 (fs) Energy= -2382.98308 (Hatree) Temperature= 6006.163
time= 32.000 (fs) Energy= -2383.45997 (Hatree) Temperature= 6799.884
time= 32.000 (fs) Energy= -2383.45767 (Hatree) Temperature= 6799.894
…
This started to happen for every run. We actually have a new Bull HPC-Cluster at RWTH Aachen University where I work. The previous version of the code (3.5) I installed on our old cluster and it ran perfectly well (essentially more CPUs, upgrade to Scientific Linux 6.1, new Batch System Platform LSF 8, etc. with respect to the old cluster). Now this problem occurs for the old version (3.5), the old version reinstalled on the new cluster as well as with the new version of the code (3.6).
Any thoughts?
With best regards,
Denis

Re: convergence problems with OpenMX 3.6 ( No.6 )

Date: 2011/11/12 02:19
Name: N.Kolchenko <nkolchenko@mail.ru>

Hi,

I've never seen such results earlier, although I use -O2 level as a rule (for programs with MPI too).
I agree, that calculation result dependence on compiler optimization level is very strange and potentially dangerous situation for any (not only OpenMX) code.
It is very interesting to clear up the nature of the effect and may be to reproduce it by simple program example (of course, if it is not inner compiler, MKL or MVAPICH bug) to eliminate it in future.

Computational environment :
processor - Intel EM64T Xeon X54xx (Harpertown) 3000 MHz
OS - CentOS 5.6
Compiler - Intel Compiler XE 12.0
Lib - MKL (12.0.3)
MVAPICH-1.2rc1.

More detailed information about hardware are available on site of Joint Supercomputer Center (mvs100k). (But I'm afraid it may be obsolete...)

The results described by Atsushi M. Ito are similar in hardware and software enviroment, system type(PBC !!), but drastically different in compiler options.
With -Dnompi option I've ALWAYS got normal Utot-values for ANY type of system.

Regards,

NK

Re: convergence problems with OpenMX 3.6 ( No.7 )

Date: 2011/11/14 17:17
Name: T.Ozaki

Hi,

To address the issue, we have released a patch.
Please take a look at
http://www.openmx-square.org/forum/patio.cgi?mode=view&no=1351

It would be very nice if you test them, and report your result
on the forum.

Thank you very much for your cooperation in advance.

Regards,

TO

Re: convergence problems with OpenMX 3.6 ( No.8 )

Date: 2011/11/14 22:25
Name: Denis Music <music@mch.rwth-aachen.de>

Hello,
Thanks a lot for your efforts. It’s really great that you managed to come up with a new patch so quickly. I've recompiled the code as follows:
option I
CC = mpicc -openmp -O0 -I/usr/local/include -I$(HOME)/include
option II
CC = mpicc -openmp -O3 -I/usr/local/include -I$(HOME)/include
Energy-wise it's fine (“minor” memory leak issue still remains), but I don't get rid of the time-step sequence problem.
With best regards,
Denis

Re: convergence problems with OpenMX 3.6 ( No.9 )

Date: 2011/11/15 06:32
Name: N.Kolchenko <nkolchenko@mail.ru>

Hello,

Thanks a lot for your weekend efforts.

Test calculations of GaAs were done for two platforms:

(1) CC = mpicc -openmp -O2 ; LIB = MKL(12.0.3)

(2) CC = gcc -O2 -fopenmp -Dnompi ; LIB = MKL(11.1).

At first sight results are very good:

Uele. -24.777812622570 (GaAs.out from patch3.6.1)
Uele. -24.777812557532 (1)
Uele. -24.777812557533 (2)

Ukin. 199.855515255774
Ukin. 199.855515277725
Ukin. 199.855515277724

UH0. -1803.476854443712
UH0. -1803.476854443708
UH0. -1803.476854443711

UH1. 0.005710466096
UH1. 0.005710465894
UH1. 0.005710465894

Una. -41.108102136840
Una. -41.108102160460
Una. -41.108102160460

Unl. -150.724995804425
Unl. -150.724995801476
Unl. -150.724995801480

Uxc0. -19.367445896556
Uxc0. -19.367445900142
Uxc0. -19.367445900142

Uxc1. -19.367445820102
Uxc0. -19.367445823685
Uxc0. -19.367445823685

Ucore. 1644.494082264319
Ucore. 1644.494082264319
Ucore. 1644.494082264319

Utot. -189.689536115446 (7 digits after point are identical)
Utot. -189.689536121534
Utot. -189.689536121541

One can guess that GaAs.out(patch) was calculated with another (may be ATLAS) library (see, especially, neutral atom potential and exchange-correlation energies)...

Regards,

NK

Re: convergence problems with OpenMX 3.6 ( No.10 )

Date: 2011/11/15 13:17
Name: T.Ozaki

Hi,

Thank you very much for reporting your benchmark calculations.

As for the issue raised by Dr. Music:

> I don't get rid of the time-step sequence problem.

I think this is a problem related to MPI environment.
The problem can happen when the mpi library you use for the job submission
is different from that you use for the compilation.
In other words, the two MPI libraries should be consistent between
compilation and execution.

In some environment in which many different MPI libraries are installed
on the same machine, such an inconsistency easily happens.

Regards,

TO

Re: convergence problems with OpenMX 3.6 ( No.11 )

Date: 2011/11/15 16:03
Name: Denis Music <music@mch.rwth-aachen.de>

Hello everybody,
Thanks a lot for all your ideas and kind support. All probelms are gone now by using the Intel MKL library, instead of acml. The code runs really well.
Regards,
Denis

Re: convergence problems with OpenMX 3.6 ( No.12 )

Date: 2011/11/16 16:55
Name: N.Kolchenko <nkolchenko@mail.ru>

Dear developers,

I didn't want to mix all in one thread, but now it seems that problem with Utot divergence is fixed, and one can pay attention to another question.

There are few warnings during the compilation:

(1)

Memory_Leak_test.c(299): warning #181: argument is incompatible with corresponding format string conversion
printf("Used_VSZ (kbyte) = %6d\n", (long int)(Used_VSZ));fflush(stdout);
^

Memory_Leak_test.c(300): warning #181: argument is incompatible with corresponding format string conversion
printf("Used_RSS (kbyte) = %6d\n", (long int)(Used_RSS));fflush(stdout);

It's the same thing reported by D.Music (reply N4). It can't be problem during runtime.

I'm not so sure about:

(2)

!===mpicc compiler===========

liberi-091216/source/eri.c(133): warning #167: argument of type "double **" is incompatible with parameter of type "const double **"
ERI_LL_Overlap_d(solver, p, dp, alp1, dalp1, alp2, dalp2, cx);
^

liberi-091216/source/eri.c(133): warning #167: argument of type "double **" is incompatible with parameter of type "const double **"
ERI_LL_Overlap_d(solver, p, dp, alp1, dalp1, alp2, dalp2, cx);
^

liberi-091216/source/eri.c(228): warning #167: argument of type "double *" is incompatible with parameter of type "double (*)[3][2]"
ERI_Integral_GL_d(solver, I4, dI4, glF, glG, dglF, dglG,
^

liberi-091216/source/eri.c(228): warning #167: argument of type "double **" is incompatible with parameter of type "const double **"
ERI_Integral_GL_d(solver, I4, dI4, glF, glG, dglF, dglG,
^

liberi-091216/source/eri.c(228): warning #167: argument of type "double **" is incompatible with parameter of type "const double **"
ERI_Integral_GL_d(solver, I4, dI4, glF, glG, dglF, dglG,
^
!==========================

The same for gcc (partly in Russian):

liberi-091216/source/eri.c: В функции ‘ERI_Overlap’:
liberi-091216/source/eri.c:133: предупреждение: в передаче аргумента 5 ‘ERI_LL_Overlap_d’: несовместимый тип указателя
liberi-091216/source/eri.h:527: замечание: expected ‘const double **’ but argument is of type ‘double **’
liberi-091216/source/eri.c:133: предупреждение: в передаче аргумента 7 ‘ERI_LL_Overlap_d’: несовместимый тип указателя
liberi-091216/source/eri.h:527: замечание: expected ‘const double **’ but argument is of type ‘double **’
liberi-091216/source/eri.c: В функции ‘ERI_Integral’:
liberi-091216/source/eri.c:229: предупреждение: в передаче аргумента 3 ‘ERI_Integral_GL_d’: несовместимый тип указателя
liberi-091216/source/eri.h:571: замечание: expected ‘double (*)[3][2]’ but argument is of type ‘double *’
liberi-091216/source/eri.c:229: предупреждение: в передаче аргумента 6 ‘ERI_Integral_GL_d’: несовместимый тип указателя
liberi-091216/source/eri.h:571: замечание: expected ‘const double **’ but argument is of type ‘double **’
liberi-091216/source/eri.c:229: предупреждение: в передаче аргумента 7 ‘ERI_Integral_GL_d’: несовместимый тип указателя
liberi-091216/source/eri.h:571: замечание: expected ‘const double **’ but argument is of type ‘double **’
!================================================================

That is really so:

!====in eri.h==================

void ERI_LL_Overlap_d(
ERI_t *ptr,
double *p,
double *dp[3],
const double *a1,
const double *da1[3],
const double *a2,
const double *da2[3],
double x
);

!==in eri.c========================
......
double *dgam1, *dgam2, *dalp1[3], *dalp2[3], *dp[3], *dF[3];
......
ERI_LL_Overlap_d(solver, p, dp, alp1, dalp1, alp2, dalp2, cx);

!==============================

!====in eri.h==================
void ERI_Integral_GL_d(
ERI_t *ptr,
double I4[2], /* (OUT) */
double dI4[4][3][2],
const double *F1, /* (IN) Overlap matrix */
const double *F2, /* (IN) */
const double *dF1[3], /* (IN) Overlap matrix */
const double *dF2[3], /* (IN) */
double R, /* (IN) Displacement of two expansion centers */
double theta,
double phi,
double cx12,
double cx34,
double delta,
double omega, /* (IN) screening parameter */
int lmax1
);

!==in eri.c========================

double *dI4, /* (OUT) derivatives */
...........
double *glF, *glG, *dglF[3], *dglG[3];
............

ERI_Integral_GL_d(solver, I4, dI4, glF, glG, dglF, dglG,
R[0], R[1], R[2], cx12, cx34, 1e-10, scr, lmax_gl);
!========================================

I guess that intent(In) attribute (sorry for fortran terms) for formal parameters must be preserved, and one must only to change definition for input arguments. (I think, *dI4 - scalar - is "slip of the pen").

Regards,

NK

Re: convergence problems with OpenMX 3.6 ( No.13 )

Date: 2011/11/16 14:42
Name: T.Ozaki

Dear Dr. Kolchenko,

Thank you very much for reminding us of the warning during the compilation.
Since we know that those parts do not concern conventional calculations
supported by OpenMX Ver. 3.6, we will not fix them immediately, and but
fix them in the next release.

Thank you very much for your cooperation.

Best regards,

Taisuke Ozaki

Page: [1]