Taisuke Ozaki (ISSP), Masahiro Fukuda (ISSP), and Gengping Jiang (Wuhan Univ. of Sci. and Tech.)
First-principles electronic structure calculations based on the density functional theories (DFT) [1,2] have been playing a versatile role in a wide variety of materials sciences to deeply understand physical and chemical properties of existing materials and even to design novel materials having a desired property before actual experiments [3,4,5,6]. In recent years more complicated materials with secondary structures such as heterointerfaces [7,8] and dislocations  have been becoming a scope of application by DFT calculations. Since these complicated structures cannot be easily modeled by a small unit cell, development of efficient DFT methods in accordance with development of massively parallel computers is crucial to realize such large-scale DFT calculations. Among efficient DFT methods [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26,25,27,28,29,30,31,32,33,34], O() methods whose computational cost scales linearly as a function of the number of atoms have enabled us to extend the applicability of DFT to large-scale systems [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]. Nevertheless applications of the O() methods to metallic systems have been still limited because of the fundamental difficulty of a truncation scheme in real space, which is an idea commonly adopted in most of the O() methods [12,13], in realizing the O() methods  as discussed in the Appendix. A theoretically proper approach to go beyond the truncation scheme is to take account of the contribution from the external region beyond the truncated region via a self energy in Green function formalisms [26,27,28,29,30]. Another straightforward approach is to use a relatively large cutoff radius in the truncation scheme in order to reach a sufficient accuracy. The latter approach might be suited to the divide-conquer (DC) method [31,32] among the O() methods proposed so far in the following twofold aspects: (i) There is a way to reduce the computational cost in the framework of the DC method by lowering the dimension of matrices with an introduction of a Krylov subspace [33,34]. (ii) The computational time of the DC method can be reduced by a massive parallelization, since the calculations in the DC method are performed nearly independently for each atom [35,36]. With the two aspects the improvement of the DC method can be a promising direction to develop an accurate, efficient, and robust O() method applicable to not only insulators and semi-conductors, but also metals. Along this line, the DC method based on the Krylov subspace has been proven to be an efficient and accurate O() method by a wide variety of applications such as dynamics of Li ions in a lithium ion battery [37,38,39] and structure optimization of semicoherent heterointerfaces in steel [7,8]. However, there exist drawbacks in the generation of the Krylov subspace . Since the Krylov subspace is generated at the first self-consistent field (SCF) step and kept unchanged during the subsequent SCF calculation, calculated quantities such as the electron density and total energy depend on the initial guess of electron density or Hamiltonian matrix elements. If the Krylov subspace is regenerated every SCF step to avoid the dependency on the initial guess, the computational efficiency must be largely degraded. In addition the iterative calculations in the generation of the Krylov subspace tend to suffer from numerical round-off error, leading to an uncontrollable behavior in the SCF calculation if the Krylov subspace is regenerated every SCF step . Therefore, a more robust approach needs to be developed to improve the DC method, which overcomes the drawbacks inherent in the DC method based on the Krylov subspace.
In this paper, we focus on localized single-particle natural orbitals (LNOs) calculated by a low-rank approximation to perform a coarse graining of basis functions, and apply the LNOs to the DC method, which will be referred to as the DC-LNO method hereafter, to reduce the dimension of matrices without sacrificing the accuracy. The DC-LNO method overcomes the drawbacks in the DC method based on the Krylov subspace, while taking account of reduction of the prefactor in the computational cost and a simple algorithm with less communication leading to a high parallel efficiency. A series of benchmark calculations clearly demonstrates that the DC-LNO method is an accurate, efficient, and robust O() method applicable to not only insulators and semi-conductors, but also metals in step with recent development of massively parallel computers.
The paper is organized as follows: In the section II we propose a method to generate LNOs, and present the DC-LNO method as an extension of the DC method. In the section III the implementations of the method are discussed in detail. In the section IV a series of benchmark calculations is presented. In the section V we summarize the theory of the DC-LNO method and numerical aspects.
We consider an extension of a divide-conquer (DC) approach [31,32] by introducing a coarse graining of basis functions as shown in Fig. 1. For each atom the Kohn-Sham (KS) Hamiltonian and overlap matrices are truncated within a given cutoff radius, and the resultant truncated cluster problem is solved atom by atom, leading to the O() scaling in the computational cost. As expected the error of truncation can be systematically reduced in exchange for the increase of computational cost as the cutoff radius increases. The number of atoms in a truncated cluster exceeds 300 atoms in many cases in order to attain a sufficient accuracy as discussed later on. To reduce the computational cost we introduce a coarse graining of basis functions that the original basis functions, pseudo-atomic orbitals (PAOs) in our case [40,41], are replaced by localized single-particle natural orbitals (LNOs) in the long range (yellow) region to represent the Hamiltonian and overlap matrices, while the PAO functions remain unchanged in the short range (orange) region. In the following subsections a method of generating LNOs and a DC method with LNOs are discussed in detail.
We present a method of generating LNOs based on a low-rank approximation via a local eigendecomposition of a projection operator. The method might be applicable to any local basis functions, and not limited to the application to the DC method we discuss in the paper. Even in the conventional O() calculations, the LNOs can be easily obtained by a non-iterative calculation using the density matrix and overlap matrix elements. Since a smaller number of LNOs well reproduce dispersion of occupied bands, they can be an alternative basis set for a compact representation for the Hamiltonian and overlap matrices. Though in this subsection we present a method of calculating LNOs in a general form starting from Bloch functions to clarify a mathematical basis of LNOs, it should be clearly noticed in this place that the density matrix is directly calculated in the DC-LNO method without calculating the Bloch functions. The general formulation presented here provides a theoretical basis of three step algorithm which will be discussed in the end of the subsection.
Under the Born-von Karman boundary condition
we expand a Kohn-Sham (KS) orbital
indexed with a k-vector in the first Brillouin zone and band index , using PAOs [33,34],
being a real function, as
The computational procedure to generate LNOs is summarized as follows: (i) Calculation of . For each atom the matrix is calculated by Eq. (17), where the summation over is limited within a finite range because of the locality of PAOs in real space. (ii) Diagonalization of . Since the matrix is nonsymmetric, the diagonalization in Eq. (20) is performed by a generalized eigenvalue solver for a nonsymmetric matrix such as DGEEV in LAPACK . (iii) Selection of . Eigenvectors whose eigenvalues are larger than or equal to the threshold value are selected as LNOs. Only the overlap and density matrices are required to calculate LNOs through the steps (i)-(iii) above. Therefore, either conventional O() methods or O() methods can be employed as eigenvalue solver as long as they generate the density matrix. As for LNOs other than those in the central cell with , it is apparent from the derivation that one can obtain by parallel translation of with the lattice vector . The method can also be extended to choose another energy window. In the projection operator defined by Eq. (7), the Fermi-Dirac function is introduced to choose the occupied space. However, one can choose a proper energy window in the definition of the projection operator depending on what we discuss, which enables us to focus on specific bands such as localized -bands near the Fermi level. In this sense, LNOs can be utilized like Wannier functions (WFs) [49,50], while WFs are obtained through a unitary transformation of Bloch functions rather than the low-rank approximation, and they are orthonormal each other unlike LNOs. It is also worth pointing out that our method shares the basic idea based on the projection with other methods such as the quasiatomic orbitals scheme [51,52,53], a projection method , and a method via selected columns of the density matrix [55,56].
It might be possible for the method we present in the paper to be applied for other localized basis functions such as finite element methods [57,58] and finite difference methods [59,60,61]. In those cases one may introduce spatial partitioning methods such as Voronoi tessellation to decompose basis functions rather than focusing on basis functions on a single grid. A similar procedure can be applied for a set of partitioned basis functions.
Here we consider an extension of the DC method [31,32] using LNOs discussed in the previous subsection.
Our theoretical basis to formulate the O() DC method is that the total energy and atomic forces
in the KS framework can be calculated by using electron density , density matrix ,
and energy density matrix defined by
The overall procedure of the DC method with LNOs is summarized as follows: (i) Calculation of LNOs. LNOs are calculated by Eq. (20) for all atoms in the central cell with . At every SCF step LNOs are updated, leading to self-consistent determination of LNOs. (ii) Construction of and . For each atom the KS and overlap matrices for a truncated cluster associated with the atom are constructed by Eq. (26). (iii) Diagonalization of by making use of a parallel eigenvalue solver. (iv) Finding a common chemical potential to conserve the total number of electrons in the system using Eq. (27), as discussed in Ref.  in detail. (v) Calculation of density matrix and energy density matrix using Eqs. (24), (25), and (27). (vi) Calculation of electron density using Eq. (23).
We have implemented the DC-LNO method into the OpenMX DFT software package  which is based on norm-conserving pseudopotentials (PPs) [67,68] and optimized pseudo-atomic orbitals (PAOs) [40,41] as basis set. All the benchmark calculations were performed with a computational condition of a production level. The basis functions used are C6.0-s2p2d1, Si7.0-s2p2d1, Ti7.0-s2p2d1, O6.0-s2p2d1, Li8.0-s3p2, Al7.0-s2p2d1, and Fe5.5-s3p2d2 for carbon, silicon, titanium, oxygen, lithium, aluminum, and iron, respectively, where in the abbreviation of basis functions such as C6.0-s2p2d1, C stands for the atomic symbol, 6.0 the cutoff radius (Bohr) in the generation by the confinement scheme, and s2p2d1 means the employment of two, two, and one optimized radial functions for the -, -, and -orbitals, respectively. The radial functions were optimized by a variational optimization method . These basis functions we used can be regarded as double zeta plus polarization basis sets if we follow the terminology of Gaussian basis functions. As valence electrons in the PPs we included and , and , , , , and , and , , , and , and , and , , , and states for carbon, silicon, titanium, oxygen, lithium, aluminum, and iron, respectively. All the PPs and PAOs we used in the study were taken from the database (2013) in the OpenMX website , which were benchmarked by the delta gauge method . Real space grid techniques are used for the numerical integrations and the solution of the Poisson equation using FFT with the energy cutoff of 300 Ryd . We used a generalized gradient approximation (GGA) proposed by Perdew, Burke, and Ernzerhof to the exchange-correlation functional . An electronic temperature of 300 K is used to count the number of electrons by the Fermi-Dirac function for all the systems we considered.
The short and long range regions depicted in Fig. 1 are determined as follows: (i) We first pick up atoms in a sphere with a given cutoff radius . (ii) Among the atoms selected by the step (i) we distinguish the first neighboring atoms (FNAs) having non-zero overlap with the central atom in terms of basis functions, and remaining atoms other than FNAs are called the second neighboring atoms (SNAs), where the number of FNAs and SNAs are and , respectively. (iii) The short range region is determined by adjusting a cutoff radius so that the number of atoms in a sphere with a radius of can be as close as possible to , where the parameter can vary from 0 to 1, and we will discuss the choice of later on. (iv) The long range region consists of remaining atoms other than atoms selected by the step (iii). If we assign FNAs to atoms in the short range region, the total energy does not converge to the numerically exact one calculated by the conventional diagonalization method even if the cutoff radius increases systematically. This is because the error with the low-rank approximation by Eq. (22) keeps increasing as the cutoff radius increases. To avoid the situation, we add a buffer region consisting of about atoms as described by the step (iii) above, which guarantees the convergence of the total energy and other quantities as a function of the cutoff radius . Throughout the study, we used of for all the systems by taking the accuracy into account more than the efficiency, and did not adjust the parameter, while a smaller value, which well balances both the accuracy and efficiency, can be employed for some systems.
The way of parallelization for the DC-LNO method on parallel computers will be discussed together with its benchmark calculations later on.
In order to investigate to what extent LNOs can span occupied spaces, we compare band dispersions of gapped and metallic systems calculated with PAOs and LNOs. Figure 2(a) and (b) show band dispersions of diamond and silicon calculated by a conventional O() diagonalization method with PAOs and LNOs. For both the cases the SCF calculations were performed by using PAOs. For the case of LNOs the band dispersions were calculated with the LNOs after the SCF calculations with PAOs. The number of LNOs per atom is 4 for both carbon and silicon atoms. It is found that in both the cases the band dispersions of occupied space are well reproduced with LNOs compared to those calculated by PAOs, while a large difference can be seen in conduction bands between PAOs and LNOs as expected. The good agreement between PAOs and LNOs in describing the occupied bands implies that the low-rank approximation by Eq. (22) is reasonably valid. As shown in Table I we see that the first four eigenvalues of the matrix are actually dominant for both diamond and silicon, justifying the low-rank approximation. The largest and the next three eigenvalues correspond to a -orbital and -like orbitals deformed by contribution of -orbitals, respectively. It is also noted that the eigenvalues can be negative, which is related to a negative value of Mulliken populations for delocalized orbitals [71,72]. As well as the gapped systems, similar calculations were performed for metals, lithium in the body centered cubic (BCC) structure and aluminum in the face centered cubic (FCC) structure as shown in Figs. 3 and 4, respectively. The band dispersions calculated with the minimal LNOs are reasonably compared to those by PAOs, while the use of the five and nine LNOs for Li and Al atoms fully reproduce the band dispersions including conduction bands as shown in Figs. 3(b) and 4(b), respectively. One can confirm again in Table I that eigenvalues for the minimal LNOs are dominant even for metals, while the magnitude of the subsequent eigenvalues is relatively large compared to those of the gapped systems. Thus, we conclude that LNOs can be regarded as a compact basis set spanning well the occupied space for both gapped and metallic systems.
As a first step of validation of the DC-LNO method with respect to computational accuracy and efficiency, we show in Fig. 5 the absolute error in the total energy of gapped and metallic systems calculated by the DC and the DC-LNO methods, where the reference energies were calculated by the conventional O() method with dense k-points for the Brillouin zone sampling as given in the caption of Figs. 2-4. For all the cases the threshold value in Eq. (22) was set to be 0.1 which gives us the minimal LNOs corresponding to orbitals of valence electrons. In the gapped systems, diamond, silicon, and rutile TiO, the absolute error decreases almost exponentially as the number of atoms in a truncated cluster increases. The overall behaviors of the error between the DC and DC-LNO methods are similar to each other, while the error by the DC-LNO method is accidentally much smaller than that by the DC method in the case of large truncated clusters of diamond. As well as the gapped systems the absolute error for metals calculated by the DC-LNO method decreases as increasing the number of atoms in a truncated cluster in a similar way to the DC method. The relatively large oscillating behavior observed in the metals might be related to long range characteristics of the off-diagonal Green functions as discussed in the Appendix. For all the cases including metals it is found that a truncated cluster including about 300 atoms is required to attain the millihartree accuracy corresponding to the error less than a few millihartree/atom in the total energy. From the comparison between the DC and DC-LNO methods, we see that the computational accuracy does not degrade largely even if basis functions for atoms in the long range region are approximated by LNOs, and that thereby the computational accuracy can be controlled mainly by the size of truncated cluster just like for the DC method. Since controlling only the single parameter allows us to balance the computational accuracy and efficiency, it is expected that the feature makes the DC-LNO method easy to use for a wide variety of applications.
Since the total number of basis functions to represent the Hamiltonian of the truncated cluster is
reduced by introducing LNOs while keeping the accuracy,
it is expected that the computational time can be substantially reduced.
In the whole procedure of the DC-LNO method the calculation of LNOs and construction of Hamiltonian
and overlap matrices occupy a small fraction of the whole computational time, typically less than ,
and thereby the computational time is mainly governed by solving of the eigenvalue problem
for each atom .
Noting that the computational time to solve the eigenvalue problem scales as the third power
of the dimension of the matrices, the ratio of computational time between the DC-LNO and DC methods
for elemental systems might be estimated by
In Fig. 7 we further show timing results of the DC-LNO method as a function of the number of atoms in the unit cell of Si crystal. It is confirmed from the linear increase of the computational time that the DC-LNO method is a linear scaling approach in practice. As a comparison the computational time is also shown for the conventional diagonalization method . The crossing point between the two methods in the computational time is located around 100 atoms when 1280 CPU cores is used. Over 100 atoms the DC-LNO method is much faster than the conventional diagonalization method. The reason why the crossing point is located at such a small number of atoms is partly due to a better parallel efficiency of the DC-LNO method as discussed in the next subsection.
To minimize the computational time on massively parallel computers we introduce a multilevel parallelization using message passing interface (MPI). In our implementation there are three levels for the parallelization, i.e., atom level, spin level, and diagonalization level as explained below: (i) Parallelization in the atom level. If the number of MPI processes is smaller than that of atoms, only the parallelization in the atom level is taken into account. The allocation of atoms to MPI processes is performed by a bisection method which is based on a projection of atoms onto a principal axis calculated from an inertia tensor and a modified binary tree of MPI processes to minimize memory usage and amount of MPI communications . (ii) Parallelization in the spin level. If the number of MPI processes exceeds that of atoms, and the spin polarized calculation is performed, the parallelization in the spin level is introduced on top of the parallelization in the atom level, where a loop for the spin index is further parallelized. (iii) Parallelization in the diagonalization level. If the number of MPI processes is larger than the product of the number of atoms and the multiplicity of spin index, corresponding to 1, 2, and 1 for non-spin polarized, spin-polarized, and non-collinear calculations, respectively, a parallelization in the diagonalization level is further taken into account on top of both the parallelizations in the atom and spin levels. The parallelization in the diagonalization level is made by employing a parallel eigenvalue solver ELPA . It is noted that the parallelization in the diagonalization level requires a considerable amount of MPI communications, while the parallelizations in the atom and spin levels have less MPI communications. So, one would expect a high parallel efficiency in the atom and spin levels, while the parallelization in the diagonalization level might be limited up to several tens of MPI processes. To achieve a better scaling for the parallelization in the diagonalization level, it is important to allocate CPU cores in the same computer node as MPI processes to avoid the inter-node communication as much as possible. We have implemented the multilevel parallelization so that amount of the inter-node communication can be minimized especially for the parallelization in the diagonalization level. In Fig. 8 the speed-up ratio in the MPI parallelization of the DC-LNO method is shown for non-spin polarized calculations of a diamond supercell containing 64 atoms. Since the multiplicity of spin index is 1, we see a nearly ideal behavior up to 64 MPI processes. Beyond 64 MPI processes the parallelization in the diagonalization level is taken into account on top of the parallelization in the atom level. A superlinear speed-up is observed at 128 and 256 MPI processes, which might be due to an effective use of cache by the reduction of memory usage, and a good scaling is achieved up to 1280 MPI processes at which the parallel efficiency is calculated to be 70% using the elapsed time at 1 MPI process as reference. Since each computer node has 20 CPU cores in this case, it would be reasonable to observe the good scaling up to 1280 (=6420) MPI processes. Thus, we see that the multilevel parallelization is very effective to minimize the computational time in accordance with a recent development of massively parallel computers.
To verify the accuracy of forces on atoms calculated by the DC-LNO method, results of NVE molecular dynamics (MD) simulations are shown in Fig. 9. We see that the sum of the kinetic energy of atomic nuclei (Kinetic) and the internal total energy (), being a conserved quantity, is reasonably conserved as a function of time, and the fluctuation is about one tenth of the kinetic energy or the internal total energy. It should be noted that the approximate conservation of the sum is achieved for not only Si being a semi-conductor, but also Al being a metal. Thus, it can be concluded that the accuracy of forces on atoms calculated by the DC-LNO method is sufficient for practical purposes, while it was remarked in the Sec. II that the forces on atoms in the DC-LNO method are not calculated variationally. The sufficient accuracy of the calculated forces is achieved by the use of large cutoff radii in constructing the truncation clusters, which is realized by both the introduction of LNOs and the massive parallelization with the multilevel parallelism.
To further demonstrate the applicability of the DC-LNO method for MD simulations, we show radial distribution functions (RDFs) in liquid phases of silicon, aluminum, lithium, and SiO in Fig. 10. Since the electronic structures exhibit metallic features in the liquid phases of silicon, aluminum, and lithium, the MD simulations can be considered as a severe benchmark to validate the applicability of the DC-LNO method to metals. The cutoff radii we used correspond to truncated clusters consisting of about 300 atoms in the ideal bulk structures. It turns out that in all the cases the DC-LNO method reproduces well the results by the conventional diagonalization method, and that the obtained RDFs are well compared to other computational results [75,76,77,78]. The considerable agreement between the DC-LNO and conventional methods strongly implies that a sufficient accuracy in reproducing at least RDF for MD simulations can be attainable with a cutoff radius resulting in truncated clusters consisting of about 300 atoms for not only insulators but also metals. Thus, adjusting the cutoff radius so that the number of atoms in a truncated cluster can be 300 atoms would be a compromise to balance the computational accuracy and efficiency, while the difference between the DC and DC-LNO methods in terms of the computational efficiency may not be significant for truncated clusters of this size. It is crucial to minimize the elapsed time for realization of long time MD simulations. With the computational condition the elapsed time per SCF step for silicon is 1.5 (sec.) on average using 1280 MPI processes on the same machine used for the calculations shown in Fig. 8.
We have presented an efficient O() method based on the DC approach and a coarse graining of basis functions by localized single-particle natural orbitals (LNOs) for large-scale DFT calculations. A straightforward way to attain sufficient accuracy in the DC method is to employ a relatively large cutoff radius for the truncation of a system, which is the most fundamental parameter in most of O() methods to control the computational accuracy and efficiency. We have adopted the rather brute force approach, and attempted to decrease the computational cost by introducing LNOs as basis functions in the long range region of the truncated cluster, and to minimize the elapsed time in the computation with the help of a multilevel parallelization. The method of generating LNOs is based on a low-rank approximation to the projection operator for the occupied space by a local eigendecomposition at each atomic site, and the band structure calculations with PAOs and LNOs clearly show that the resultant LNOs span well the occupied space of not only gapped systems but also metals. It is also worth mentioning that the computational cost of generating LNOs is almost negligible thank to the independent calculation at each atomic site. By replacing PAOs with LNOs in the long range region of the truncated cluster in the DC method, the computational cost of the DC method can be reduced without largely sacrificing the accuracy. Nothing that the DC-LNO method holds the simple algorithm of the original DC method suited to the parallel calculation, we have implemented a multilevel parallelization using MPI by taking account of the atom level, spin level, and diagonalization level. It was demonstrated that the speed-up of the DC-LNO method by the multilevel parallelization can be expected up to a specific number of MPI processes which corresponds to the product of the number of atoms, the multiplicity of spin index, and the number of CPU cores in a single computer node. For example, if a spin-polarized calculation is performed for a system consisting of 1000 atoms on a parallel computer with 20 CPU cores per node, a high parallel efficiency might be expected up to 40000 MPI processes. As a validation of the applicability of the DC-LNO method, we have performed MD simulations for liquid phases of an insulator, semi-conductor, and metals, and confirmed that the RDFs calculated by the DC-LNOs are in good agreement with those by the conventional diagonalization method, which may lead to its various applications to structural determinations of amorphous and liquid structures of complicated materials [79,80,81,82,83]. Considering the simplicity and robustness of the algorithm, we conclude that the DC-LNO method is an efficient and accurate approach to large-scale DFT calculations for a wide variety of materials including metals.
This work was supported by Priority Issue (Creation of new functional devices and high-performance materials to support next-generation industries) to be tackled by using Post 'K' Computer, MEXT, Japan. Part of the computation in the study was performed using the computational facility of the Japan Advanced Institute of Science and Technology.
As an example we show asymptotic behaviors of the off-diagonal Green functions for an one-dimensional (1D) tight-binding (TB) model with a single -orbital on each site, and relate the asymptotic behaviors to electronic structures in gapped and metallic systems. The analysis interprets evidently the oscillating behavior of the error in the total energy of the metallic systems as shown in Fig. 5, and the rapid convergence in a high electronic temperature [33,24].
Let us consider an orthogonal chain model with the nearest neighbor interaction and the on-site energy
as defined by
We now relate the asymptotic behaviors of Green functions
to the calculation of density matrix which is defined by Eq. (24). Introducing the Matsubara expansion
of the Fermi-Dirac function, and changing the integration path with the Cauchy theorem, one has 
Although the matrix elements of the Green function associated with only the central atom in the calculation of each truncated cluster are calculated in our implementation, it is possible to generalize the single central atom to a cluster consisting of several atoms. An optimum size of the central cluster might be adjusted in order to reduce the computational cost as discussed in the Appendix of Ref. . However, symmetries in bulks, as can be confirmed in charge density and Mulliken populations, tend to be violated by introducing the central cluster instead of the single central atom. Therefore, we choose the single central atom rather than the central cluster in order to preserve the symmetrical properties.