In the band calculation, a triple parallelization is made for three loops: spin multiplicity, k-points, and eigenstates, where the spin multiplicity is one for the spin-unpolarized and non-collinear calculations, and two for the spin-polarized calculation, respectively. The priority of parallelization is in order of spin multiplicity, k-points, and eigenstates. In addition, when the number of processes used in the parallelization exceeds (spin multiplicity)(the number of k-points), OpenMX uses an efficient way in which finding the Fermi level and calculating the density matrix are performed by just one diagonalization at each k-point. For the other cases, twice diagonalizations are performed at each k-point for saving the size of used memory in which the second diagonalization is performed to calculate the density matrix after finding the Fermi level. In Fig. 25 (c) we see a good speed-up ratio as a function of processes in the elapsed time for a spin-unpolarized calculation of carbon diamond consisting of 64 carbon atoms with 333 k-points. The input file 'DIA64_Band.dat' is found in the directory 'work'. In this case the spin multiplicity is one, and the number of k-points used for the actual calculation is (3*3*3-1)/2+1=14, since the k-points in the half Brillouin zone is taken into account for the collinear calculation, and the -point is included when all the numbers of k-points for a-, b-, and c-axes are odd. So it is found that the speed-up ratio exceeds the ideal one in the range of processes over 14, which means the algorithm in the parallelization is changed to the efficient scheme. As well as the cluster calculation, OpenMX Ver. 3.9 employs ELPA [39] to solve the eigenvalue problem in the band calculation, which is a highly parallelized eigevalue solver. Either ELPA1 or ELPA2 can be chosen by the following keyword:
scf.eigen.lib elpa1 # elpa1|elpa2, default=elpa1The default choice is ELPA1. Our benchmark calculations suggest that ELPA1 and ELPA2 are comparable to each other with respect to the computational speed, while we do not show the benchmark calculations here.