Re: convergence problems with OpenMX 3.6 ( No.1 ) |
- Date: 2011/11/11 21:01
- Name: T.Ozaki
- Hi,
Could you try -O0 and -O1 as compiler option, and report again? Also, could you tell us the error related to memory leak that you noticed?
Thank you for cooperation in advance.
Regards,
TO
|
Re: convergence problems with OpenMX 3.6 ( No.2 ) |
- Date: 2011/11/11 22:18
- Name: N.Kolchenko <nkolchenko@mail.ru>
- Hi,
Yes, with -O0 or -O1 option results are correct.
It seems, that name of the problem is "Band solver in MPI mode". (May be more precisely - "the systems with PBC in MPI mode").
Without MPI "Band solver" work normally. With MPI (and -O2 or -O3 compiler option) the above mentioned (5,6,8,14 input_example) abnormal results are obtained for both 3.6 and 3.5(!) versions.
Regards,
NK
|
Re: convergence problems with OpenMX 3.6 ( No.3 ) |
- Date: 2011/11/11 23:33
- Name: T.Ozaki
- Hi,
We have never seen such an abnormal result on our computational facilities. I would like to know more details regarding what kinds of computational environment such as processor, OS (version), compiler (version), linked libraries (version) and etc. causes this kind of problems.
Also, the error may be related to the other report at http://www.openmx-square.org/forum/patio.cgi?mode=view&no=1309
Thank you very much for your cooperation in advance.
Regards,
TO
|
Re: convergence problems with OpenMX 3.6 ( No.4 ) |
- Date: 2011/11/12 00:03
- Name: Denis Music <music@mch.rwth-aachen.de>
- Hello again,
Thanks a lot for the input! I recompiled the code with the -O1 option and the only error message (as before) is: Memory_Leak_test.c(299): warning #181: argument is incompatible with corresponding format string conversion printf("Used_VSZ (kbyte) = %6d\n", (long int)(Used_VSZ));fflush(stdout); Memory_Leak_test.c(300): warning #181: argument is incompatible with corresponding format string conversion printf("Used_RSS (kbyte) = %6d\n", (long int)(Used_RSS));fflush(stdout); Now the convergence nightmare is gone. Itfs all much better (acceptable in my opinion). 1 input_example/Benzene.dat Elapsed time(s)= 15.85 diff Utot= 0.000000005260 diff Force= 0.000000000005 2 input_example/C60.dat Elapsed time(s)= 97.06 diff Utot= 0.000000060208 diff Force= 0.000000002300 3 input_example/CO.dat Elapsed time(s)= 41.77 diff Utot= 0.000000005357 diff Force= 0.000000013463 4 input_example/Cr2.dat Elapsed time(s)= 27.63 diff Utot= 0.000000006281 diff Force= 0.000000000003 5 input_example/Crys-MnO.dat Elapsed time(s)= 61.89 diff Utot= 0.000000003682 diff Force= 0.000000000003 6 input_example/GaAs.dat Elapsed time(s)= 249.43 diff Utot= 0.000000006088 diff Force= 0.074246673815 7 input_example/Glycine.dat Elapsed time(s)= 62.08 diff Utot= 0.000000024529 diff Force= 0.000000000092 8 input_example/Graphite4.dat Elapsed time(s)= 27.09 diff Utot= 0.000000032393 diff Force= 0.000000000126 9 input_example/H2O-EF.dat Elapsed time(s)= 30.96 diff Utot= 0.000000013254 diff Force= 0.000000000341 10 input_example/H2O.dat Elapsed time(s)= 53.20 diff Utot= 0.000000013228 diff Force= 0.000000012034 11 input_example/HMn.dat Elapsed time(s)= 205.25 diff Utot= 0.000000000457 diff Force= 0.000000000112 12 input_example/Methane.dat Elapsed time(s)= 16.92 diff Utot= 0.000000006360 diff Force= 0.000000001502 13 input_example/Mol_MnO.dat Elapsed time(s)= 61.72 diff Utot= 0.000000002437 diff Force= 0.000000000366 14 input_example/Ndia2.dat Elapsed time(s)= 25.68 diff Utot= 0.000000017584 diff Force= 0.000000000002
With best regards, Denis
|
Re: convergence problems with OpenMX 3.6 ( No.5 ) |
- Date: 2011/11/12 01:23
- Name: Denis Music <music@mch.rwth-aachen.de>
- Hi,
Some info about the computational environment I use: Scientific Linux 6.1, Intel Xeon, acml-3-6-0-gnu-64bit, Intel MPI Therefs another problem that I have discovered. It seems that the md sequence is somehow incoherent. Please have a look at the following (melt at high temperatures): c time= 14.000 (fs) Energy= -2383.47584 (Hatree) Temperature= 7805.914 time= 14.000 (fs) Energy= -2383.47549 (Hatree) Temperature= 7808.225 time= 26.000 (fs) Energy= -2382.48904 (Hatree) Temperature= 5190.235 time= 15.000 (fs) Energy= -2383.03728 (Hatree) Temperature= 7067.631 time= 14.000 (fs) Energy= -2383.47575 (Hatree) Temperature= 7808.423 time= 29.000 (fs) Energy= -2382.98479 (Hatree) Temperature= 6005.790 time= 15.000 (fs) Energy= -2383.03634 (Hatree) Temperature= 7070.145 time= 27.000 (fs) Energy= -2382.62941 (Hatree) Temperature= 5450.185 time= 29.000 (fs) Energy= -2382.98222 (Hatree) Temperature= 6005.147 time= 27.000 (fs) Energy= -2382.63118 (Hatree) Temperature= 5450.336 time= 15.000 (fs) Energy= -2383.03635 (Hatree) Temperature= 7070.085 time= 30.000 (fs) Energy= -2383.15085 (Hatree) Temperature= 6282.041 time= 30.000 (fs) Energy= -2383.15253 (Hatree) Temperature= 6281.631 time= 23.000 (fs) Energy= -2382.01161 (Hatree) Temperature= 4595.027 time= 29.000 (fs) Energy= -2382.98249 (Hatree) Temperature= 6005.417 time= 27.000 (fs) Energy= -2382.63166 (Hatree) Temperature= 5478.975 time= 30.000 (fs) Energy= -2383.15240 (Hatree) Temperature= 6282.250 time= 28.000 (fs) Energy= -2382.80883 (Hatree) Temperature= 5723.596 time= 30.000 (fs) Energy= -2383.15002 (Hatree) Temperature= 6281.650 time= 31.000 (fs) Energy= -2383.31392 (Hatree) Temperature= 6548.080 time= 31.000 (fs) Energy= -2383.31410 (Hatree) Temperature= 6547.894 time= 28.000 (fs) Energy= -2382.81090 (Hatree) Temperature= 5723.647 time= 30.000 (fs) Energy= -2383.15069 (Hatree) Temperature= 6281.873 time= 28.000 (fs) Energy= -2382.80936 (Hatree) Temperature= 5781.812 time= 31.000 (fs) Energy= -2383.31127 (Hatree) Temperature= 6548.545 time= 31.000 (fs) Energy= -2383.30871 (Hatree) Temperature= 6547.873 time= 29.000 (fs) Energy= -2382.98308 (Hatree) Temperature= 6006.163 time= 32.000 (fs) Energy= -2383.45997 (Hatree) Temperature= 6799.884 time= 32.000 (fs) Energy= -2383.45767 (Hatree) Temperature= 6799.894 c This started to happen for every run. We actually have a new Bull HPC-Cluster at RWTH Aachen University where I work. The previous version of the code (3.5) I installed on our old cluster and it ran perfectly well (essentially more CPUs, upgrade to Scientific Linux 6.1, new Batch System Platform LSF 8, etc. with respect to the old cluster). Now this problem occurs for the old version (3.5), the old version reinstalled on the new cluster as well as with the new version of the code (3.6). Any thoughts? With best regards, Denis
|
Re: convergence problems with OpenMX 3.6 ( No.6 ) |
- Date: 2011/11/12 02:19
- Name: N.Kolchenko <nkolchenko@mail.ru>
- Hi,
I've never seen such results earlier, although I use -O2 level as a rule (for programs with MPI too). I agree, that calculation result dependence on compiler optimization level is very strange and potentially dangerous situation for any (not only OpenMX) code. It is very interesting to clear up the nature of the effect and may be to reproduce it by simple program example (of course, if it is not inner compiler, MKL or MVAPICH bug) to eliminate it in future.
Computational environment : processor - Intel EM64T Xeon X54xx (Harpertown) 3000 MHz OS - CentOS 5.6 Compiler - Intel Compiler XE 12.0 Lib - MKL (12.0.3) MVAPICH-1.2rc1.
More detailed information about hardware are available on site of Joint Supercomputer Center (mvs100k). (But I'm afraid it may be obsolete...)
The results described by Atsushi M. Ito are similar in hardware and software enviroment, system type(PBC !!), but drastically different in compiler options. With -Dnompi option I've ALWAYS got normal Utot-values for ANY type of system.
Regards,
NK
|
Re: convergence problems with OpenMX 3.6 ( No.7 ) |
- Date: 2011/11/14 17:17
- Name: T.Ozaki
- Hi,
To address the issue, we have released a patch. Please take a look at http://www.openmx-square.org/forum/patio.cgi?mode=view&no=1351
It would be very nice if you test them, and report your result on the forum.
Thank you very much for your cooperation in advance.
Regards,
TO
|
Re: convergence problems with OpenMX 3.6 ( No.8 ) |
- Date: 2011/11/14 22:25
- Name: Denis Music <music@mch.rwth-aachen.de>
- Hello,
Thanks a lot for your efforts. Itfs really great that you managed to come up with a new patch so quickly. I've recompiled the code as follows: option I CC = mpicc -openmp -O0 -I/usr/local/include -I$(HOME)/include option II CC = mpicc -openmp -O3 -I/usr/local/include -I$(HOME)/include Energy-wise it's fine (gminorh memory leak issue still remains), but I don't get rid of the time-step sequence problem. With best regards, Denis
|
Re: convergence problems with OpenMX 3.6 ( No.9 ) |
- Date: 2011/11/15 06:32
- Name: N.Kolchenko <nkolchenko@mail.ru>
- Hello,
Thanks a lot for your weekend efforts.
Test calculations of GaAs were done for two platforms:
(1) CC = mpicc -openmp -O2 ; LIB = MKL(12.0.3)
(2) CC = gcc -O2 -fopenmp -Dnompi ; LIB = MKL(11.1).
At first sight results are very good:
Uele. -24.777812622570 (GaAs.out from patch3.6.1) Uele. -24.777812557532 (1) Uele. -24.777812557533 (2)
Ukin. 199.855515255774 Ukin. 199.855515277725 Ukin. 199.855515277724
UH0. -1803.476854443712 UH0. -1803.476854443708 UH0. -1803.476854443711
UH1. 0.005710466096 UH1. 0.005710465894 UH1. 0.005710465894
Una. -41.108102136840 Una. -41.108102160460 Una. -41.108102160460
Unl. -150.724995804425 Unl. -150.724995801476 Unl. -150.724995801480
Uxc0. -19.367445896556 Uxc0. -19.367445900142 Uxc0. -19.367445900142
Uxc1. -19.367445820102 Uxc0. -19.367445823685 Uxc0. -19.367445823685 Ucore. 1644.494082264319 Ucore. 1644.494082264319 Ucore. 1644.494082264319
Utot. -189.689536115446 (7 digits after point are identical) Utot. -189.689536121534 Utot. -189.689536121541
One can guess that GaAs.out(patch) was calculated with another (may be ATLAS) library (see, especially, neutral atom potential and exchange-correlation energies)... Regards,
NK
|
Re: convergence problems with OpenMX 3.6 ( No.10 ) |
- Date: 2011/11/15 13:17
- Name: T.Ozaki
- Hi,
Thank you very much for reporting your benchmark calculations.
As for the issue raised by Dr. Music:
> I don't get rid of the time-step sequence problem.
I think this is a problem related to MPI environment. The problem can happen when the mpi library you use for the job submission is different from that you use for the compilation. In other words, the two MPI libraries should be consistent between compilation and execution.
In some environment in which many different MPI libraries are installed on the same machine, such an inconsistency easily happens.
Regards,
TO
|
Re: convergence problems with OpenMX 3.6 ( No.11 ) |
- Date: 2011/11/15 16:03
- Name: Denis Music <music@mch.rwth-aachen.de>
- Hello everybody,
Thanks a lot for all your ideas and kind support. All probelms are gone now by using the Intel MKL library, instead of acml. The code runs really well. Regards, Denis
|
Re: convergence problems with OpenMX 3.6 ( No.12 ) |
- Date: 2011/11/16 16:55
- Name: N.Kolchenko <nkolchenko@mail.ru>
- Dear developers,
I didn't want to mix all in one thread, but now it seems that problem with Utot divergence is fixed, and one can pay attention to another question.
There are few warnings during the compilation:
(1)
Memory_Leak_test.c(299): warning #181: argument is incompatible with corresponding format string conversion printf("Used_VSZ (kbyte) = %6d\n", (long int)(Used_VSZ));fflush(stdout); ^
Memory_Leak_test.c(300): warning #181: argument is incompatible with corresponding format string conversion printf("Used_RSS (kbyte) = %6d\n", (long int)(Used_RSS));fflush(stdout);
It's the same thing reported by D.Music (reply N4). It can't be problem during runtime.
I'm not so sure about:
(2)
!===mpicc compiler===========
liberi-091216/source/eri.c(133): warning #167: argument of type "double **" is incompatible with parameter of type "const double **" ERI_LL_Overlap_d(solver, p, dp, alp1, dalp1, alp2, dalp2, cx); ^
liberi-091216/source/eri.c(133): warning #167: argument of type "double **" is incompatible with parameter of type "const double **" ERI_LL_Overlap_d(solver, p, dp, alp1, dalp1, alp2, dalp2, cx); ^
liberi-091216/source/eri.c(228): warning #167: argument of type "double *" is incompatible with parameter of type "double (*)[3][2]" ERI_Integral_GL_d(solver, I4, dI4, glF, glG, dglF, dglG, ^
liberi-091216/source/eri.c(228): warning #167: argument of type "double **" is incompatible with parameter of type "const double **" ERI_Integral_GL_d(solver, I4, dI4, glF, glG, dglF, dglG, ^
liberi-091216/source/eri.c(228): warning #167: argument of type "double **" is incompatible with parameter of type "const double **" ERI_Integral_GL_d(solver, I4, dI4, glF, glG, dglF, dglG, ^ !==========================
The same for gcc (partly in Russian):
liberi-091216/source/eri.c: B
~{yy eERI_Overlapf: liberi-091216/source/eri.c:133: ut
uwtu~yu: r uutpu ps
}u~p 5 eERI_LL_Overlap_df: ~ur}uy}z y
{pxpu| liberi-091216/source/eri.h:527: xp}up~yu: expected econst double **f but argument is of type edouble **f liberi-091216/source/eri.c:133: ut
uwtu~yu: r uutpu ps
}u~p 7 eERI_LL_Overlap_df: ~ur}uy}z y
{pxpu| liberi-091216/source/eri.h:527: xp}up~yu: expected econst double **f but argument is of type edouble **f liberi-091216/source/eri.c: B
~{yy eERI_Integralf: liberi-091216/source/eri.c:229: ut
uwtu~yu: r uutpu ps
}u~p 3 eERI_Integral_GL_df: ~ur}uy}z y
{pxpu| liberi-091216/source/eri.h:571: xp}up~yu: expected edouble (*)[3][2]f but argument is of type edouble *f liberi-091216/source/eri.c:229: ut
uwtu~yu: r uutpu ps
}u~p 6 eERI_Integral_GL_df: ~ur}uy}z y
{pxpu| liberi-091216/source/eri.h:571: xp}up~yu: expected econst double **f but argument is of type edouble **f liberi-091216/source/eri.c:229: ut
uwtu~yu: r uutpu ps
}u~p 7 eERI_Integral_GL_df: ~ur}uy}z y
{pxpu| liberi-091216/source/eri.h:571: xp}up~yu: expected econst double **f but argument is of type edouble **f !================================================================
That is really so:
!====in eri.h==================
void ERI_LL_Overlap_d( ERI_t *ptr, double *p, double *dp[3], const double *a1, const double *da1[3], const double *a2, const double *da2[3], double x );
!==in eri.c======================== ...... double *dgam1, *dgam2, *dalp1[3], *dalp2[3], *dp[3], *dF[3]; ...... ERI_LL_Overlap_d(solver, p, dp, alp1, dalp1, alp2, dalp2, cx);
!==============================
!====in eri.h================== void ERI_Integral_GL_d( ERI_t *ptr, double I4[2], /* (OUT) */ double dI4[4][3][2], const double *F1, /* (IN) Overlap matrix */ const double *F2, /* (IN) */ const double *dF1[3], /* (IN) Overlap matrix */ const double *dF2[3], /* (IN) */ double R, /* (IN) Displacement of two expansion centers */ double theta, double phi, double cx12, double cx34, double delta, double omega, /* (IN) screening parameter */ int lmax1 );
!==in eri.c========================
double *dI4, /* (OUT) derivatives */ ........... double *glF, *glG, *dglF[3], *dglG[3]; ............
ERI_Integral_GL_d(solver, I4, dI4, glF, glG, dglF, dglG, R[0], R[1], R[2], cx12, cx34, 1e-10, scr, lmax_gl); !========================================
I guess that intent(In) attribute (sorry for fortran terms) for formal parameters must be preserved, and one must only to change definition for input arguments. (I think, *dI4 - scalar - is "slip of the pen").
Regards,
NK
|
Re: convergence problems with OpenMX 3.6 ( No.13 ) |
- Date: 2011/11/16 14:42
- Name: T.Ozaki
- Dear Dr. Kolchenko,
Thank you very much for reminding us of the warning during the compilation. Since we know that those parts do not concern conventional calculations supported by OpenMX Ver. 3.6, we will not fix them immediately, and but fix them in the next release.
Thank you very much for your cooperation.
Best regards,
Taisuke Ozaki
|
|