Principal Component Analysis and Related Methods for Investigating the Dynamics of Biological Macromolecules

Kitao, Akio

doi:10.3390/j5020021

Open AccessReview

Principal Component Analysis and Related Methods for Investigating the Dynamics of Biological Macromolecules

by

Akio Kitao

School of Life Science and Technology, Tokyo Institute of Technology, Tokyo 152-8550, Japan

J 2022, 5(2), 298-317; https://0-doi-org.brum.beds.ac.uk/10.3390/j5020021

Submission received: 11 May 2022 / Revised: 15 June 2022 / Accepted: 17 June 2022 / Published: 20 June 2022

(This article belongs to the Special Issue Advance in Molecular Thermodynamics)

Download Versions Notes

Abstract

:

Principal component analysis (PCA) is used to reduce the dimensionalities of high-dimensional datasets in a variety of research areas. For example, biological macromolecules, such as proteins, exhibit many degrees of freedom, allowing them to adopt intricate structures and exhibit complex functions by undergoing large conformational changes. Therefore, molecular simulations of and experiments on proteins generate a large number of structure variations in high-dimensional space. PCA and many PCA-related methods have been developed to extract key features from such structural data, and these approaches have been widely applied for over 30 years to elucidate macromolecular dynamics. This review mainly focuses on the methodological aspects of PCA and related methods and their applications for investigating protein dynamics.

Keywords:

principal component analysis; collective variables; molecular dynamics; energy landscape; solvent effects; linear response theory; independent component analysis

1. Historical Overview

Principal component analysis (PCA) is a widely used multivariate analysis approach, originally proposed about 100 years ago [1,2], that has found increasing applications since the widespread availability of digital computers to reduce the dimensionality of high-dimensional datasets. This reduction is enabled by linear transformation from the original variables to new collective variables, so that a small number of “principal components” dominate the features of the dataset. Now PCA is considered as an unsupervised machine learning technique.

The structures of proteins and other biological macromolecules are well characterized by a set of multidimensional variables, such as atomic coordinates and dihedral angles, and information regarding the dynamics of these molecules is typically obtained as a time series of high-dimensional data or an ensemble of experimentally determined structures. Although such large ensembles of high-dimensional data contain useful information, they are not easily interpretable. Therefore, extracting important features from high-dimensional data is essential to understand the dynamics of biological macromolecules.

Similar to the increased application of PCA in other areas, the use of PCA in analyzing protein dynamics has gradually become more common as the performance of computers has improved, making the molecular simulations of proteins more accessible. The first molecular dynamics (MD) simulation of a small folded protein, bovine pancreatic trypsin inhibitor (BPTI), in vacuum was conducted in 1977 [3], and the first protein normal mode analysis (NMA) for BPTI was performed in 1983 [4,5,6]. NMA is a harmonic approximation of protein dynamics at a potential energy minimum, and clearly shows that the low-frequency normal modes of proteins are collective motions of the atoms spread over the entire protein (namely, global motions) and that the lowest normal mode frequency is a few cm⁻¹. Since the vibrational frequencies of bond stretching modes are higher than this by three orders of magnitude, the amplitudes of the lowest and highest modes also differ by three-fold, indicating the highly anisotropic nature of proteins even within the range of vibrational motions. This high anisotropy may be partly attributed to the highly packed structures of folded native proteins, whose packing densities are comparable to that of a face-centered cubic lattice [7]. In highly packed structures, local motions uncorrelated with the surroundings are limited to small amplitudes because of possible collisions, while concerted motions of groups of atoms such as protein domains or loops can move in certain directions largely without altering atomic packing. In 1981, Karplus and Kushick proposed a method to estimate the configurational entropy of macromolecules from NMA, MD and Monte Carlo (MC) simulations [8]. That publication also showed that simulations with (NMA) and without harmonic approximation (MD and MC) can be connected by PCA. The length of the first reported protein MD was 8.8 ps [3], which roughly corresponds to one period of the lowest-frequency normal mode of typical small globular proteins and thus was insufficiently long to sample large-amplitude motions of the protein. However, increasing simulation lengths allowed investigation of the quasi-harmonic features of butane and BPTI, mainly focusing on quasi-harmonic frequencies deduced from PCA [9,10]. Later, projecting simulation trajectories onto collective coordinates was shown to be very useful for characterizing dominant protein dynamics, but the early stages of this endeavor used low-frequency normal modes for the projected collective variables [11]. Since normal modes are determined based only on one energy minimum, they are not necessarily the best choice to investigate the anharmonic nature of protein dynamics. In contrast, PCA determines principal coordinates as the collective coordinates, which incorporate anharmonic features included in the MD or MC trajectory. Longer and more realistic MD simulations in solution were performed from the 1980s to the early 1990s, allowing the PCA of MD trajectories. In the early 1990s, the anisotropic and anharmonic nature of native protein dynamics was elucidated by PCA, focusing on principal components (PCs), defined as the projections onto the principal coordinates [12,13,14,15]. PCA was also shown to be useful for analyzing simulation trajectories of protein folding/non-folding dynamics [16,17]. The past three decades have been seen the frequent use of PCA to investigate the dynamic behavior of biopolymers, as well as many important methodological improvements and the elucidation of simulated dynamic features [18,19,20,21,22,23]. Since PCA employs a variance–covariance matrix for dimensionality reduction, it is useful to characterize large-amplitude conformational change in molecules, such as protein domain motion and folding. However, PCA may not be sensitive for detecting localized, small amplitude but functionally important motions, such as backrub motion [24], peptide-plane flip [25], the side-chain flip and path-preserving motions [26].

This review provides an overview of PCA and related methods and their applications for investigating protein dynamics, focusing mainly on methodological aspects. In addition, some basic concepts and important findings obtained during the early years of this field are revisited for the benefit of non-experts, as well as a review of the latest progress in PCA-related research. The following PCA applications demonstrate the examples in which macromolecular dynamics cannot be well characterized without the use of PCA.

2. Basic Concept behind PCA

The investigation of macromolecular dynamics by PCA requires the selection of certain degrees of freedom of the target molecule that characterize the dynamics well. Consider a vector of general coordinates of a target molecule or molecules,

q

, and suppose that

q = \{q_{i}\}

is a column vector consisting of

f

variables (

i = 1, \dots, f

). Thus,

〈q〉 = \{〈q_{i}〉\}

is the average and

Δ q = q - 〈q〉

is the deviation from the average. To explicitly indicate an index of the mth data point among M (

m = 1, \dots, M

), we use the expression

Δ q_{m}

. When the MD trajectory is considered, the expression

Δ q (t_{m})

is employed to specify a coordinate set at time

t_{m}

. To conduct PCA, a variance–covariance matrix

C

is introduced:

C = \frac{1}{M} \sum_{m = 1}^{M} Δ q_{m} Δ q_{m}^{T} = 〈Δ q Δ q^{T}〉 = \{〈(q_{i} - 〈q_{i}〉) (q_{j} - 〈q_{j}〉)〉\},

(1)

where

Δ q^{T}

represents the transpose of

Δ q

and

〈\dots〉

denotes the simple average over a given dataset. The matrix

C = \{C_{i j}\}

is a positive semidefinite whose eigenvalues are non-negative. By introducing an

f \times M

matrix of the whole dataset:

Q = \{Δ q_{1} \dots Δ q_{M}\},

(2)

A

can be obtained as a matrix product:

C = \frac{1}{M} Q Q^{T},

(3)

where

Q^{T}

denotes the transpose of

Q

. By solving the standard eigenvalue problem with the orthonormal condition:

C V = V λ,

(4)

V V^{T} = V^{T} V = I,

(5)

we obtain

V

,

λ

and

I

, which are the eigenvector, eigenvalue and unit matrices, respectively.

λ

is a diagonal matrix whose αth diagonal element

\{λ_{α}\}

(

α = 1, \dots, f

) is the variance of the αth PC, and the αth column vector

v_{α}

of

V

is the corresponding eigenvector. Typically,

λ_{α}

is sorted in descending order such that the first PC shows the largest variance

λ_{1}

and the corresponding column vector

v_{1}

of

V = (v_{1} \dots v_{f})

indicates the eigenvector of the first PC.

Projection of

Δ q_{m}

onto

V

provides:

σ_{m} = V^{T} Δ q_{m},

(6)

where

σ_{m}

is a column vector of the projections (principal components). The overall linear transformation of

Q

using

V

gives a projection matrix

Σ = (σ_{1} \dots σ_{f})

onto the PC:

Σ = V^{T} Q .

(7)

The

f \times M

matrix

Σ

represents the matrix of principal components, which are collective variables defined as a linear combination of the original coordinates and the elements of

V

are coefficients for the transformation.

PCA can be conducted by solving the standard eigenvalue problem (Equation (4)), which requires the diagonalization of

C

(

f \times f

matrix). PCA can also be performed by the singular value decomposition (SVD) of

Q

(

f \times M

matrix). SVD directly provides a decomposition of

Q

into three matrices:

Q = \sqrt{M} V λ^{1 / 2} U^{T},

(8)

where

U^{T}

is an

M \times M

matrix of normalized projections defined as:

U^{T} = \frac{1}{\sqrt{M}} λ^{- 1 / 2} Σ = \frac{1}{\sqrt{M}} λ^{- 1 / 2} V^{T} Q .

(9)

In this case,

λ^{1 / 2}

is an

f \times M

matrix whose non-diagonal elements are zero. From Equations (4), (5) and (9), it is straightforward to obtain the condition:

U^{T} U = I .

(10)

To quantify the usefulness of PCA for a target dataset, the PC contribution to the total variance is examined, which is defined as:

χ_{_{α}} = \frac{λ_{α}}{\sum_{α = 1}^{f} λ_{α}} .

(11)

If the contributions from a small number of PCs to the total variance are dominant, PCA is useful for dimensionality reduction because most of the total variance originates from these PCs. Typically, folded proteins are highly anisotropic in nature and PCA is thus useful for analyzing protein dynamics [18,19,20,22].

It is also straightforward to consider hierarchy in the distribution of a target dataset, for example, the distribution of clusters and the distributions of data in each cluster. Suppose that the dataset of interest is clustered into L groups, each consisting of

n_{l}

data points. In this case,

{〈\dots〉}_{l}

indicates the mean over

n_{l}

and

f_{l} = n_{l} / K

is the fraction of the lth group. The variance–covariance matrix can be divided into two terms as [27]:

\begin{array}{l} C = C^{JAM} + C^{intra} \\ = \{\sum_{l = 1}^{L} f_{l} 〈({〈q_{i}〉}_{l} - 〈q_{i}〉) ({〈q_{j}〉}_{l} - 〈q_{j}〉)〉 + \sum_{l = 1}^{L} f_{l} {〈(q_{i} - {〈q_{i}〉}_{l}) (q_{j} - {〈q_{j}〉}_{l})〉}_{l}\} \end{array}

(12)

The first term represents the variance–covariance originating from the distribution of the means of the groups and the second term shows the

f_{l}

weighted average of intra-group variance–covariance. Although Equation (12) shows a two-hierarchy model, extension to multiple hierarchy is straightforward. If the distribution of each group shows fluctuations in a certain local energy minimum, the first term originates from jumping among energy minima. Using this formulation, the jumping-among-minima (JAM) model shows that the JAM motions that contribute to the first term dominate a small number of anharmonic large-amplitude motions and the second term is attributed to nearly harmonic fluctuations around local energy minima that are detected as Gaussian-like distributions [19,27]. This hierarchical view of protein collective dynamics is useful for understanding the boson peak and glass transition of proteins [28], and for identifying multiple conformers in nuclear magnetic resonance (NMR)-derived structure ensembles of proteins [29,30].

3. Error in PCA

A dataset to be analyzed by PCA may contain an insufficient number of samples statistically, which can result in instability of the obtained eigenvectors. This situation frequently occurs when standard MD simulations of macromolecules are conducted with all-atom models because the accessible simulation time scale tends to be shorter than the characteristic time scale of macromolecular movement. Hess showed that random diffusions in high-dimensional space can result in cosine-shaped projections of the first few dominant PCs, indicating that short simulation trajectories should be carefully treated [31]. However, using random matrix theory (RMT) [32,33], Palese showed that protein dynamics is not truly Brownian even on a short time scale (~1 ns) while PCA leads to cosine-shaped projections, using apo Cox17 as a model protein [34]. In this method,

C

is considered to be the sum of the random component

C_{r}

and non-random component

C_{n r}

.

C_{r}

is determined by an iterative method based on RMT, providing the cleaned

C_{n r}

without the random component and its eigenvectors. Palese also proposed random component analysis (RCA) as a random projection (RP) algorithm [35]. In RCA, PCA of the correlation matrix

P = \{C_{i j} / \sqrt{C_{i i} C_{j j}}\}

is considered, and

P

is replaced by a random symmetric correlation matrix

M

as a dummy correlation. PCA of

M

and projection onto the obtained eigenvectors provides the random components. RCA provides dimensionality reduction and cluster detection comparable to that of PCA. Ref. [36] introduced and examined a parameter that evaluates the overlap between such subspaces, called the root mean squared inner product (RMSIP), and suggested that PCA for the concatenated equivalent trajectories achieves better reproducibility.

4. Relation with NMA

NMA is closely related to PCA as mentioned in Section 1. To conduct NMA, the second derivative matrix of potential energy

E

(Hessian)

F

should be calculated at a certain conformation, typically at a local potential energy minimum conformation where the first derivatives

\partial E / \partial q_{i} = 0

for all i:

F = \{f_{i j}\} = \{\partial^{2} E / \partial q_{i} \partial q_{j}\}

(13)

when Cartesian coordinates are used for Equation (13),

q

should be mass weighted Cartesian coordinates. To obtain normal mode frequencies and eigenvectors, the standard eigenvalue problem of

F

is solved as:

F W = W ω^{2},

(14)

W^{T} W = W W^{T} = I .

(15)

The βth column vector of

W = (w_{1} \dots w_{f})

,

w_{β}

, represents the βth eigenvector. The eigenvalue matrix

ω^{2} = \{ω_{β}^{2}\}

determines the angular frequency of the normal modes

ω_{β}

. Since the variance–covariance matrix

C

is related to

F

by

C = k_{B} T F^{- 1}

(

k_{B}

: Boltzmann constant,

T

: absolute temperature), we obtain the relation for the harmonic system:

λ_{β} = k_{B} T / ω_{β}^{2}

(16)

Therefore, comparing

λ_{α}

obtained by PCA of MD or MC trajectories to the NMA-derived

k_{B} T / ω_{α}^{2}

is a straightforward way to examine the anharmonicity or quasi-harmonic features of protein dynamics.

W

is determined for a potential energy minimum, while MD simulation can sample multiple energy minima. Therefore,

V

obtained from an MD trajectory can be significantly different from

W

calculated around a particular local energy minimum. This difference becomes larger as the MD length is increased. To consider the difference between two collective variables and to examine the anharmonicity of an energy surface, the variance expected from NMA along the αth PC is obtained by:

λ_{α}^{h a r} = k_{B} T \sum_{β} \frac{{(w_{β} \cdot v_{α})}^{2}}{ω_{β}^{2}} .

(17)

λ_{α}^{h a r}

is further used to define the anharmonicity observed in MD along the αth PC, namely, the anharmonicity factor:

μ_{α} = {(λ_{α} / λ_{α}^{h a r})}^{1 / 2} .

(18)

μ_{α}

is unity if the variance is equal to that expected from NMA, indicating that the energy surface along the αth PC is nearly harmonic [27,37]. For short MD trajectories up to 1 ns, the majority of PCs are harmonic and less than 1% of PCs show

μ_{α} > 2

, and anharmonic motions dominantly contribute to the total variance [19,27].

NMA of proteins was originally conducted with full atomic models [4,5,6]. Over the past 20 years or so, NMA has been more widely used with coarse-grained models as well as with atomic models. Although this is an interesting topic related to PCA, NMA is beyond the scope of this review but is covered in reviews on NMA [38,39,40,41,42,43].

5. Solvent and Other Environmental Effects on Macromolecular Dynamics

To consider solvent effects on macromolecular dynamics, one of the simplest models is the consideration of the independent Langevin equation for each PC

σ_{α} (t)

with harmonic potential:

{\ddot{σ}}_{α} (t) = - ω_{α}^{2} σ_{α} (t) - γ_{α} {\dot{σ}}_{α} (t) + R_{α} (t),

(19)

where the first term of the right-hand side shows the harmonic force and

γ_{α}

and

R_{α} (t)

indicate the Stokes friction coefficient and random force acting on the αth PC, respectively.

The autocorrelation functions of

σ_{α} (t)

and velocity

{\dot{σ}}_{α} (t)

are given by:

〈σ_{α} (t) σ_{α} (0)〉 = \frac{k_{B} T}{ω_{α}^{2}} [\frac{γ_{α} + ϖ_{α}}{2 ϖ_{α}} \exp (- \frac{γ_{α} - ϖ_{α}}{2} t) - \frac{γ_{α} - ϖ_{α}}{2 ϖ_{α}} \exp (- \frac{γ_{α} + ϖ_{α}}{2} t)],

(20)

〈{\dot{σ}}_{α} (t) {\dot{σ}}_{α} (0)〉 = k_{B} T [\frac{- γ_{α} + ϖ_{α}}{2 ϖ_{α}} \exp (- \frac{γ_{α} - ϖ_{α}}{2} t) + \frac{γ_{α} + ϖ_{α}}{2 ϖ_{α}} \exp (- \frac{γ_{α} + ϖ_{α}}{2} t)],

(21)

where

ϖ_{α} = \sqrt{γ_{α}^{2} - 4 ω_{α}^{2}}

. If

γ_{α} > 2 ω_{α}

,

ϖ_{α}

is a real number and the correlation functions simply consist of two exponential decays, resulting in overdamping motion. If

γ_{α} < 2 ω_{α}

,

ϖ_{α}

consists of real and imaginary parts, with the former showing a single exponential decay and the latter providing the sum of cosine and sine terms, causing damped oscillation. Because of this relation, large-amplitude motions, i.e., low-frequency motions, tend to be

γ_{α} > 2 ω_{α}

and overdamped [12,15]. The density of state (DOS) for this system is obtained as the spectrum of the velocity correlation (Equation (21)):

S (ω) = \frac{k_{B} T}{π} \frac{γ_{α} ω^{2}}{{(ω_{α}^{2} - ω^{2})}^{2} + γ_{α}^{2} ω^{2}} .

(22)

As the ratio

γ_{α} / ω_{α}

becomes larger, the peak of the spectrum decreases and the area of the spectrum is shifted to the high-frequency region, resulting in lowered spectral density around the peak

ω_{α}

and seeming disappearance of the peak in DOS [12]. This model also explains why the DOS of BPTI from neutron scattering is lower than that expected from the frequency distribution in the low-frequency region < ~20 cm⁻¹ [15].

Assuming each PC is a harmonic Langevin oscillator, the values of

ω_{α}

and

γ_{α}

can be estimated from the MD-derived time correlation function or its spectrum in different ways, but the obtained results can be method dependent. Considering Equation (21), one way to calculate

γ_{α}

is the time derivative of the velocity autocorrelation function at

t = 0

[12,15]:

γ_{α} = - \frac{1}{k_{B} T} {(\frac{d}{d t} 〈{\dot{σ}}_{α} (Δ t) {\dot{σ}}_{α} (0)〉)}_{Δ t = 0} .

(23)

If

γ_{α}

is calculated as the numerical derivative from a difference in the velocity correlation functions between

t = 0

and

Δ t

, the numerical error should be carefully considered [15]. From Equation (21), the first-order approximation of the derivative for small

t

is obtained as:

γ_{α} (Δ t) = - \frac{1}{k_{B} T} \frac{d}{d t} 〈{\dot{σ}}_{α} (Δ t) {\dot{σ}}_{α} (0)〉 = γ_{α} + (ω_{α}^{2} - γ_{α}^{2}) Δ t,

(24)

which indicates a parabolic behavior of “apparent” friction coefficients as a function of

ω_{α}

deduced by this method, and such behavior is indeed observed [12,15]. From the parabolic feature of

γ_{α} (Δ t)

, the “true” friction coefficient

γ_{α} = γ_{α} (0)

after correction based on Equation (24) can be considered to be almost constant for large-amplitude PCs for a given protein [15]. The value of

γ_{α}

is quite dependent on protein size, consistent with Stokes–Einstein law,

γ = 6 π a η / M \propto M^{- 2 / 3}

(

a

: radius,

η

: viscosity,

M

: mass) [27].

The MD-derived friction coefficient does not necessarily originate from solvent and it can be estimated for MD conducted in vacuum. Ref. [44] used fitting Equation (21) with MD-derived data for friction calculations and showed the frequency dependence of the friction coefficient. Ref. [44] also reported that friction in vacuum is directly proportional to the intra-protein interaction of the collective mode but is also proportional to the accessible surface area of the mode in solution. In Ref. [45], MD simulations of myoglobin in aqueous solution between 120 and 300 K showed that the friction coefficient was shown to linearly increase as temperature increases up to 300 K, independent of the glass transition temperature. This tendency cannot be well explained only by the Stokes–Einstein law, at least at 0 °C and higher, because the viscosity of liquid water decreases as a function of temperature. Thus, this tendency must have a different origin. Notably, the Langevin-based time correlation function should be fitted with care because the real time correlation will likely deviate for longer t when a constant value of

γ

is used without considering the time dependence, as in the generalized Langevin equation. The range of 0–5 ps was used for fitting the time correlation function in Ref. [44] and a very short (2–6 fs)

Δ t

was used for the numerical derivative at

t = 0

in Refs [15,27].

The Langevin mode is a multidimensional version of harmonic Langevin oscillators, formulated as a natural extension of NMA to include solvent effects [46,47]. Considering the

f \times f

friction matrix

Γ

in addition to Hessian

F

, the

f \times 2 f

eigenvector matrix

S

and the

2 f \times 2 f

diagonal eigenvalue matrix

ξ = \{ξ_{α}\}

are determined by the relations:

F S + Γ S ξ + S ξ^{2} = 0,

(25)

S^{T} Γ S + ξ S^{T} S + S^{T} S ξ = ξ,

(26)

where the matrix elements of

L

and

ξ

are complex numbers. The friction matrix

Γ

can be modeled by diffusion tensors derived from hydrodynamics of polymers in solution [48,49,50]. Equations (25) and (26) correspond to (14) and (15) in NMA. It should be noted that the eigenvalue of the αth Langevin mode

ξ_{α}

can have both real and imaginary parts, which determine the damping factor and oscillatory frequency, respectively. In this case, the mode is underdamping. If

ξ_{α}

is a real number, the mode is overdamping. In the limit

Γ \to 0

, the equations for NMA are recovered as:

S = \frac{1}{\sqrt{2}} (W, W),

(27)

ξ = (\begin{matrix} \pm i ω & O \\ O & \pm i ω \end{matrix}) .

(28)

A more general formulation based on the generalized Langevin equation and 3D reference interaction site model (3D-RISM) was proposed, called the Kim–Hirata theory [51,52]. This theory introduces the variance–covariance matrix obtained from the equilibrium free energy surface in solution determined by 3D-RISM and this matrix describes the force restoring an equilibrium conformation. Ref. [52] also proposed a protocol to evaluate friction based on the friction of an imaginary atom in solution from the site–site mode coupling theory [53,54], multiplied by the fraction of a protein atom contacting the solvent defined by the radial distribution function of solvent around the protein atom. The Kim–Hirata theory was further extended to analyze the temperature-dependent mean-square deviation of proteins [55].

6. Choice of Variables and Spaces for Better Representation of Macromolecular Dynamics in PCA

To successfully analyze macromolecular dynamics, a set of variables that well represent important features of dynamics should be employed for PCA. The Cartesian coordinates of atoms are frequently used. For a set of selected N atoms of the molecules of interest, the number of variables is f = 3N. The use of all-atom coordinates, including hydrogens, is useful for characterizing the anharmonic nature of protein dynamics directly compared to NMA [12,15,27,37,56]. The selection of C_α atoms is useful for selecting a small number of large-amplitude motions, namely, “essential dynamics” [14]. Raw atomic coordinates from the original dataset typically reflect internal movements of the selected atoms, as well as overall translation and rotation. The translational and rotational components can be eliminated by the best fit of each dataset (typically a snapshot of a simulation trajectory) to a reference dataset so that the Eckart condition [57] is satisfied, for example, using the Kabsch method [58] or another method. To set the average

〈q〉

as the origin of the coordinates,

〈q〉

obtained after best fit should be used as the reference for the next round of best fit [12].

〈q〉

quickly converges within about five cycles with this procedure. Once translational and rotational components are completely eliminated, Cartesian PCA results in (3N − 6) positive eigenvalues for internal motions and six zero eigenvalues corresponding to PCs of the translation and rotation.

Internal coordinates are also used frequently in PCA. In all-atom models of macromolecules, dihedral angles mainly determine the overall conformation of each molecule because the contributions of bond length and angle changes are relatively small. The significant movements of dihedral angles result in protein atoms moving non-linearly. Therefore, PCA in dihedral angle space can provide different information on protein dynamics compared to Cartesian PCA. If the deviation of dihedral angles from the average

Δ θ

is directly used as

Δ q

, the deviation of atomic coordinates from the average

Δ r

is related to

Δ θ

as a first-order approximation as:

Δ r = L Δ θ,

(29)

where

L = d r / d θ

is the Jacobian matrix that is evaluated for the average structure [59,60]. Omori et al. investigated the relation between the motions of atoms and dihedral angles and showed that the latter mutually move in a compensative manner, called “latent dynamics” [60]. The variance–covariance matrix of dihedral angles

C_{θ}

was compared to

{\tilde{C}}_{θ}

deduced from the atomic variance–covariance matrix

C_{r}

as:

{\tilde{C}}_{θ} = {(L^{T} C_{r}^{- 1} L)}^{- 1} .

(30)

Using backbone atoms (N, C_α and C) and dihedral angles (ϕ, ψ, and ω) of small globular proteins, Omori et al. also showed that

{\tilde{C}}_{θ}

precisely recovers the information of

C_{r}

and contains higher-order dihedral correlations, but

C_{θ}

does not [60]. Additionally, the mean-square atomic displacements tended to be minimized upon rotation of the dihedral angles, indicating the compensative nature of dihedral dynamics. However, such latent dynamics behavior was not seen in dihedral PCs of deca-alanine, a short peptide.

The non-Euclidean nature of dihedral angles is not sufficiently considered in a linear transformation. PCA with dihedral angles may also require careful treatment as they are singular between 180° and −180°, also called a periodic boundary. Stock and coworkers proposed using dihedral angles differently in their dihedral angle PCA (dPCA), which uses cosines and sines of dihedral angle

θ_{l}

[61,62,63]:

q = \{q_{2 l - 1}, q_{2 l}\} = \{\cos θ_{l}, \sin θ_{l}\},

(31)

where l denotes the dihedral angle index. This treatment can be considered as using the real and imaginary parts of

\exp i θ_{l}

. Originally, Stock and coworkers focused on main chain ϕ and ψ [62], but it is straightforward to include other dihedral angles in this framework. In the case of short peptides, more rugged free energy landscapes were observed in the space spanned by the first few PCs in dPCA compared to the results of Cartesian PCA [61,62,63]. Based on the analysis of folding dynamics of the villin headpiece 35 (HP35), a small protein and native dynamics of BPTI up to the millisecond time scale, they reported that Cartesian PCA failed to capture important features of the free energy landscape and that dPCA gave a better presentation of the landscape [64].

Instead of projecting straight lines in the extrinsic tangent space of a mean, PCA for Riemannian manifolds was proposed based on geodesics of the intrinsic metric [65]. GeoPCA is a tool for dihedral angle-based principal component geodesics, in which angular data are projected on a sphere composed of the first two principal component geodesics [66]. GeoPCA was validated by using it to cluster a set of RNA conformations derived from a database comprising 73 RNA structures. Dihedral principal geodesic analysis (dPGA) was applied to reduce the dimension of the protein structure ensemble and the result was compared to the results obtained using PCA and dPCA [67].

The n-dimensional torus is a product space of n circles and can be used to characterize dihedral dynamics. Torus-PCA (T-PCA) was proposed and applied to the RNA benchmark used in GeoPCA [66], demonstrating the validity of T-PCA [68]. Another approach, dPCA+, minimizes residual projection error by transforming the data such that the maximal gap of the sampling is shifted to the periodic boundary of a dihedral angle [69]. Interestingly, this transformation also minimized the error of the covariance matrix. dPCA+ was also used to examine the non-equilibrium process simulated by targeted MD (TMD), and the free energy profiles of deca-alanine obtained by unbiased MD, Jarzynski identity and second-order cumulant approximation were compared [70]. In addition, the landscapes from unbiased MD, TMD and reweighted TMD were investigated in that report.

Other variables have been introduced to conduct PCA of simulation trajectories. To visualize MD and MC trajectories, Abagyan and Argos introduced a distance measure between two conformers, defined based on dihedral angles [71]. Java Essential Dynamics (JED) can use internal distance pair coordinates (dpPCA) as an option [72]. Ernst et al. introduced contact distance-based PCA (conPCA), reciprocal distance-based (iconPCA) and PCA based on inter-C_α distances (C_αPCA) and compared each result to those obtained using dPCA [73]. For conPCA, the distance between the closest heavy atom of each residue is considered as a contact if it is less than 4.5 Å and the residue pair of the contact is separated by more than three residues along the sequence [73,74]. Thus, conPCA can consider side chains in contact with each other but excludes information regarding local fluctuation along the sequence. Using 300 μs HP35 and 1 ms BPTI MD trajectories and examining the resolution of the free energy landscape and the decay of autocorrelation functions, Ernst et al. showed that distance-based PCAs, particularly C_αPCA, tend to be versatile, but they exhibit fewer landscape details than dPCA does [73]. Recently, Ogata proposed grid-based PCA (GBPCA), which considers a grid system consisting of cubes with 5 Å edges [75]. This method uses a unit vector of mass-weighted averages of atoms in each cube to calculate the correlations to be diagonalized and was applied to bulk water, bulk methane and hydrated proteins.

This review focuses on the PCA of structural data

Q

, but it is worth mentioning that PCA is also used for analyzing multiple spectra measured under different conditions [76]. Consider a set of spectra obtained as a function of frequency or wavelength measured under different temperatures, pH, times, etc.

Q

can be constructed with

i = 1, \dots, f

for frequency or wavelength and

m = 1, \dots, M

for the conditions. For example, rapid scanning wavelength stopped-flow kinetics experiments on liver alcohol dehydrogenase (LADH) whose reaction is spectrally complex were analyzed by PCA and absorbing species were identified [77,78]. Yuan et al. analyzed Fourier transform near-infrared (FT-NIR) spectra of bovine serum albumin (BSA) at temperatures ranging 45–85 °C by PCA and evolving factor analysis (EFA) [79]. The contributions from the first PCs obtained for two frequency ranges were both greater than 99%, indicating that most of the spectral variations are explained by the first PCs. PCA and EFA also revealed temperatures of structure changes. Sakurai and Goto conducted the PCA of pH dependence of heteronuclear sequential quantum correlation (HSQC) spectra of β-lactoglobulin measure by NMR spectroscopy and identified three conformational transitions at different pH [80]. These results validate the application of PCA for characterizing various condition-dependent spectra and for investigating structure changes, which also enables the PCA under a combination of multiple conditions, such as temperature-dependent kinetics.

As shown in Section 2, PCA and SVD provide the same information,

V

,

U

and

λ

from

Q

, but SVD directly determines these matrices without calculating

C

. Thus, SVD is also employed for analyzing spectra similar to PCA. In spectral analysis,

U

quantifies condition-dependent components while

V

describes the condition-independent basis sets. By focusing on dominant components in

λ

, SVD can act as a mechanism-independent noise filter for spectra. Thus, SVD is regarded as an automated procedure of modeling of spectroscopic datasets [81,82] and is also used for the analysis of protein dynamics. For example, Hofrichter et al. measured time-dependent optical absorption spectra from 3 ns to 100 ms after the photolysis of the CO complex of hemoglobin and identified three significant basis spectra and five exponential relaxations from the time course of their amplitudes by SVD, which enabled the analysis of ligand rebinding kinetics [83]. Moffat and coworkers developed a method to analyze time-dependent difference electron density by SVD and analyzed structural intermediates of photoactive yellow protein (PYP) [84,85,86]. These works indicate the usefulness of SVD in spectral analysis.

7. The Fluctuation–Dissipation Theorem, Linear Response Theory and PCA

The fluctuation–dissipation theorem states that the linear response of a given system to an external perturbation is expressed in terms of fluctuation properties of the system in thermal equilibrium [87,88]. In a time-independent form, the linear response theory (LRT) shows that a perturbation applied to a system

f

results in response

Δ q_{R}

mediated by the variance–covariance matrix

C

as:

Δ q_{R} \propto C f .

(32)

Using

C

obtained from MD simulations of an unliganded protein and

f

mimicking the protein–ligand interaction, LRT was shown to reproduce the response of the liganded protein [89]. Additionally, dihedral LRT based on the variance–covariance derived using Equation (30) was shown to better predict the ligand-bound form of ferric-binding protein [59]. Time-independent and time-dependent LRT showed agreement for the time response of myoglobin upon CO binding between LRT, ultraviolet resonance Raman spectroscopy and time-resolved X-ray crystallography, suggesting that the primary response can be described by LRT [90]. Hirata proposed a theory to evaluate a response function based on the aforementioned Kim–Hirata theory [91].

If LRT is considered in the PC space, the expected response (Equation (32)) in PC space

σ_{R}

is obtained in a manner similar to Equation (6) as:

σ_{R} \propto V^{T} Δ q_{R} .

(33)

Additionally, considering Equations (4) and (32), we obtain:

σ_{R} \propto λ V^{T} f = λ f_{P C} .

(34)

If

f

is applied as an isotropic random perturbation, the perturbation in PC space

f_{P C} = V^{T} f

is also isotropic. However, the response

σ_{R}

is scaled by

λ

, meaning that the perturbation force acting along the PC is proportional to

λ_{α}

[92]. Equation (34) also indicates that random perturbations are expected to cause highly anisotropic responses in the protein because proteins fluctuate in a highly anisotropic manner in equilibrium. Transform and relax sampling (TRS) enhances anisotropic protein movements, implicitly expecting the response in Equation (34) but without actually calculating

C

[92]. TRS is carried out as cycles of transform, relax and sampling stages. In the transform stage, the protein is perturbed by random forces during MD, then the protein is relaxed during MD without perturbation in the relax stage and finally usual MD is conducted as the sampling stage. TRS successfully simulated open-close motions of domains of multi-domain proteins several times within a simulation time of 20 ns. Additionally, folding–unfolding transitions of the “mini-protein” chignolin were observed many times during a 100 ns simulation.

Linear response path following (LRPF) simulates global conformational changes in proteins upon ligand binding by periodically updating a linear response (LR) force with three phases of MD, enabling non-linear transformation to a target direction [93]. In the first phase, the LR force is obtained by computing

C

and a mean force acting from ligands during equilibrium MD. Biased MD subsequently induces conformational change in the second phase, then the final MD re-equilibrates the system without bias. LRPF predicted an inward-facing form of mitochondrial ADP/ATP carrier (AAC), a membrane transporter, starting from an outward-facing form determined by X-ray crystallography [94].

8. Non-Gaussianity and Non-Linearity in PCA

PCA performs best if the distribution of a dataset is a multidimensional Gaussian that depends only on the mean and variance–covariance (second moments), which are the only quantities considered to determine collective variables in PCA. However, a small number of dominant PCs show non-Gaussian distributions in protein dynamics. These non-Gaussian collective variables are believed to be important for protein function [12,13,14,15] and indicate a limitation of using second moments in determining new collective variables. Since non-Gaussian distributions cannot be well characterized with mean and second moments only, one solution may be to consider higher-order moments to determine collective variables other than PC. Independent component analysis (ICA) separates non-Gaussian signals that are independent from Gaussian noise and typically considers mutual information (MI) to quantify the correlations and determine the optimal coordinate transformation to minimize the MI [95,96,97]. Typical ICA employs preprocessing, “centering” and “whitening” [97]. “Centering” corresponds to the treatment already completed in PCA, as in Equation (1), in which the mean is subtracted. “Whitening” is typically the normalization of components by their standard deviations. Early applications of ICA for protein dynamics by full component analysis (FCA) used Cartesian coordinates [98] while the dihedral of Equation (31) was used for ICA in Ref. [99].

Most of the methods described in this review focus on extracting mutually independent components as much as possible, whereas independent subspace (ISA) determines significantly correlated collective motions. ISA features non-Gaussian behaviors similar to ICA, using fourth-order cumulants [100]. In this method, ISA based on the subspace joint approximate diagonalization of eigenmatrices algorithm (SJADE) [101] extracts several independent subspaces, in each of which collective modes are significantly correlated while the other modes are independent. Application of this method successfully detected the modes with long-tailed non-Gaussian probability distributions [100].

Another limitation is the linear transformation of PCA, whereas protein dynamics can be highly non-linear in nature. This problem was partially discussed in Section 6, mainly in relation to the use of dihedral angles in PCA and thus other PCA variants are discussed here. For example, Nguyen proposed the use of non-linear PCA (NLPCA) [102], enabled by non-linear mapping based on neutral networks [103].

Another possible solution for incorporating non-linearity in PCA is the introduction of kernel methods [104]. In kernel PCA [104], a new “feature space”

F

, which is non-linearly related to the original space, is introduced and principal components in

F

are considered.

Δ q_{m}

is non-linearly mapped to a function

Φ (Δ q_{m})

that satisfies the condition

〈Φ (Δ q_{m})〉 = 0

. Consider the variance–covariance in

F

as:

C = \frac{1}{M} \sum_{m = 1}^{M} Φ (Δ q_{m}) Φ^{T} (Δ q_{m}) = 〈Φ Φ^{T}〉,

(35)

and the eigenvalue problem shown in Equation (4). The nth eigenvector

v_{n}

is defined as a linear combination of

Φ (Δ q_{m})

with a coefficient matrix

α = \{α_{m n}\}

as:

v_{n} = \sum_{m = 1}^{M} α_{m n} Φ (Δ q_{m}) .

(36)

Here, we introduce the kernel representation

k (x, y) = Φ (x) Φ^{T} (y)

and an M×M matrix

K

as:

K = \{k_{m n}\} = \{k (Δ q_{m}, Δ q_{n})\} .

(37)

As shown in Ref. [104],

α

is obtained by the relation:

K α = M λ α,

(38)

which can be solved by diagonalizing

K

using the condition:

λ α^{T} α = I .

(39)

Using

K

and

α

, principal components in

F

space are obtained as the projection of

Φ (Δ q)

onto the nth eigenvector by:

v_{n}^{T} Φ (Δ q_{m}) = \frac{1}{λ_{n}} \sum_{m = 1}^{M} α_{m n} k_{m n} .

(40)

Equations (38) and (40) indicate that eigenvalues and principal components in

F

space are determined from the kernel

K

without directly solving Equation (35), which is typically difficult. It is worth mentioning that the use of

k (x, y) = x y^{T}

recovers the original PCA and the use of

k (x, y) = {(x y^{T})}^{d} (d > 1)

considers higher moments. Additionally, Gaussian kernel

k (x, y) = \exp (- \frac{{‖x - y‖}^{2}}{2 σ^{2}})

with the adjustable parameter

σ

is frequently used in kernel methods. Kernel PCA can also be applied to the analysis of protein dynamics [21].

“Diffusion maps” is used as nonlinear dimensionality reduction method to embed data points into a Euclidean space in which the Euclidean distance is equal to “diffusion distance” and to conduct the reduction by neglecting certain dimensions in the diffusion space [105,106,107]. Diffusion maps considers a Markov chain random walk on normalized data points and the probability of one-step random walk between two data points (connectivity), which is proportional a kernel function (typically Gaussian kernel). For diffusion maps, “diffusion matrix” is obtained by normalizing the rows of the kernel matrix, eigenvectors of the diffusion matrix are calculated and the diffusion mapping is conducted by mapping the data points to dominant eigenvectors in the diffusion space. Ferguson et al. used diffusion maps for a protein simulation for the first time to investigate trajectories of replica-exchange molecular dynamics (REMD) simulations of pro-microcin J25 (pro-MccJ25), the 21-residue uncyclized analog of antimicrobial peptide microcin J25 (MccJ25), which identified two global order parameters and three distinct pathways of conformational change [108]. One of the pathways was shown to correspond to a conformational change to left-handed lasso coil conformations. Kim et al. employed diffusion maps to characterize MD trajectories of Trp-cage miniprotein folding [109]. Recently, Trstanova et al. demonstrated the use of diffusion maps to identify metastable states as well as to formalize the locality within the metastable states in the analysis of molecular systems [110].

9. Detecting Data Differences by PCA and Related Methods

PCA is also used to detect differences in a dataset or featurize the differences between datasets. Self-consistency of dipolar couplings analysis (SECONDA) is a PCA based on residual dipolar coupling (RDS) measured by NMR and was developed to examine the effects of different alignment media used for RDS measurements [111]. The results showed that if the structure and dynamics of the target molecule are the same, PCA gives at most five nonzero eigenvalues, but additional nonzero eigenvalues are obtained if differences exist. Howe conducted Cartesian PCA of 49 NMR-determined structures of EF40, a 28-residue peptide and demonstrated that the structures are clustered and outliers are detected [112]. Using the online tool PCA_NEST, systematic analysis of 24 pairs of enzyme structures determined by both solution NMR and X-ray crystallography revealed differences in the solution and crystal structures of the proteins [113]. Notably, the X-ray structures were shown to be a conformational state along the dominant PCs derived from NMR models, consistent with the expectation from LRT, since the environmental differences (in crystal or in solution) can be considered as a perturbation to the protein.

Sakuraba and Kono employed linear discriminant analysis with iterative procedure (LDA-ITER) to compare two trajectories obtained under different conditions [114]. LDA-ITER was developed to consider the trace ratio optimization problem in supervised learning and maximizes the ratio of two matrices by unitary transformation [115,116]. This method finds the axis of projection that separates two trajectories while keeping each trajectory well clustered. LDA-ITER was applied to the wild-type and R96H mutant of T4 lysozyme, as well as to the liganded and unliganded PDZ2 domains of human phosphatase hPTP1E. The results showed very clear separation of two kinds of trajectories along the first dimension and a Gaussian-like distribution for each cluster. In contrast, the projections onto the first two PCAs significantly overlapped, and another method, partial least squares discrimination analysis (PLS-DA) [117], gave less overlapped results in the 2D projection compared to PCA. However, LDA-ITER gave the best results. In addition, important differences were characterized on a residue-by-residue basis.

Relative principal component analysis (RPCA) also featurizes the differences between two datasets, but the transformation is determined by maximizing the Kullback−Leibler divergence between the probability distributions between the two states and by simultaneously diagonalizing two variance–covariance matrices of the states with a single transformation matrix [118]. The application of RPCA to HIV-1 protease showed better performance compared to PCA and identified conformational hotspots.

10. Time Evolution of Collective Variables

The methods described in Section 4 and Section 5 consider the time evolution of collective variables based on physical models, but another type of approach investigates evolution from a phenological perspective. Time-independent component analysis (tICA) is a variation of ICA that focuses on the time independence between time 0 and

τ

, instead of considering non-Gaussianity [119,120,121]. In addition to

C

defined in Equation (1), tICA introduces the time-lagged variance–covariance matrix:

C (τ) = 〈Δ q (0) Δ q^{T} (τ)〉,

(41)

using the time lag parameter

τ

. Although

C = C (0)

is symmetric,

C (τ)

can be a non-symmetric matrix. In tICA, the generalized eigenvalue problem is solved with the normalization condition:

C (τ) Y = C Y ζ,

(42)

Y^{T} C Y = I,

(43)

where

ζ = \{ζ_{α}\}

and

Y = (y_{1} \dots y_{f})

are eigenvalue and eigenvector matrices, respectively. Since matrix elements of

ζ

and

Y

are generally obtained as complex numbers, the symmetrization

\bar{C} (τ) = \frac{1}{2} (C (τ) + C^{T} (τ))

conducted in Refs [120,121] and Equation (42) is modified as:

\bar{C} (τ) Y = C Y ζ,

(44)

such that

ζ

and

Y

comprise real values. Combining Equation (42) or (44) with (43) gives:

Y^{T} C (τ) Y = ζ or Y^{T} \bar{C} (τ) Y = ζ,

(45)

respectively. Equations (43) and (45) indicate that the autocorrelation function of the αth time-independent component at time 0 is

y_{α}^{T} C y_{α} = 1

and that at time

τ

is

y_{α}^{T} \bar{C} (τ) y_{α} = ζ_{α}

. If the autocorrelation function is a single exponential decay with characteristic time

T_{α}

, we obtain the relation

\exp (- τ / T_{α}) = ζ_{α}

, which results in:

T_{α} = - \frac{τ}{\ln ζ_{α}} .

(46)

Dynamic component analysis (DCA) considers the time-lagged variance–covariance matrix for normalized PCs, but Equation (41) is recovered in the original coordinates [122], indicating the equivalence of tICA and DCA in this formulation. In practice, however, DCA uses the inter-residue distance (distance map) as the coordinates of PCA [122].

Relaxation mode analysis (RMA) [123,124,125,126,127] considers “whitened” signals as the αth relaxation mode

ϕ_{α} (t)

as:

c (τ) = \{〈ϕ_{α} (0) ϕ_{β} (t)〉\} = \{δ_{α β} \exp (- κ_{α} τ)\},

(47)

where

κ = \{κ_{α}\}

and

δ_{α β}

indicate the relaxation rate and Kroneker delta, respectively. Independence and single exponential decay of each relaxation mode are assumed in Equation (47). Using

c (τ)

values at time

τ

and

c = c (0)

, the relaxation rates are determined as:

c (τ) y = \exp (- κ τ) c y,

(48)

y^{T} c y = I .

(49)

These equations correspond to Equations (42) and (43) for tICA, indicating the equivalence of RMA and tICA, and that RMA in practice can be considered as tICA for the whitened signals. Similar to tICA, symmetrization

\bar{c} (τ) = \frac{1}{2} (c (τ) + c^{T} (τ))

is also used in RMA. Both tICA and RMA can obtain

T_{α}

or

κ_{α}

, which characterize a single exponential decay.

In principle, the results of these methods depend on the time lag parameter

τ

. The rigid-body domain motion of lysine-, arginine- and ornithine-binding protein (LAO) was analyzed by tICA in the ligand-free open state during a 600 ns MD using a 10 ns lag time, and the IC vectors and relaxation rates were shown to be fairly robust on this time scale [120]. tICA of LAO backbone dynamics for 1 μs was conducted using a 1 ns lag time [121], whereas DMA of the folding/unfolding dynamics of the FiP35 WW domain for two 100 μs MD trajectories used a 10 ns lag time. RMA of a 10-residue peptide, chignolin, for a 750 ns MD simulation at 450 K in solution used a relatively short lag time (10 ps) [127]. A two-step RMA was proposed and used to conduct a 2 μs MD trajectory of hen egg-white lysozyme [128].

tICA projections of high-dimensional random walks were recently shown to resemble cosine functions, be strongly dependent on the lag time and be very similar to those of 1 μs MD trajectories of proteins, particularly for larger proteins [129]. Although the introduction of a lag time allows tICA to provide richer information than PCA, care must be taken in choosing the lag time.

Time-dependent PCA (TDPCA), proposed recently, conducts multiple PCAs for short segments taken from a single MD trajectory by shifting the time window and allowing overlap, which provides time-dependent eigenvalues and eigenvectors. This approach was applied to a bulk water model and a coarse-grained protein-G model [130].

tICA and PCA are widely used as dimension reduction methods in the Markov state model (MSM) [131,132,133]. Notably, kernel methods were combined with tICA to provide kernel tICA (ktICA) for MSM [134]. The use of tICA-related methods in MSM is described in detail in recent review articles on MSM [135,136].

11. Conclusions

This review reported the development of PCA and related methods for analyzing protein dynamics, from basic concepts to the latest advanced methods. Possible applications of PCA are now very broad and many variations in PCA have been developed for specific purposes. As is clear from this review, it is difficult to specify the best and most versatile PCA method. Rather, the most suitable method should be chosen based on the purpose of the simulation and analysis. In many of the examples described, improved methods provided better performance than “classical” PCA. Additionally, the choice of original coordinates is important, as shown in Section 6. In addition, different methods should be tested, allowing the best method to be selected or to examine the validity of the obtained conclusion.

It is worth mentioning that some of the advanced methods described in this review do not directly employ original coordinates for large molecules but rather use a two-step procedure. Specifically, standard PCA and dimensional reduction into a smaller subspace are conducted first, then more advanced component analysis is conducted. For example, FCA [98] and ISA [101] described in these references used 100 PCs. DCA described in [122] used dimensional reduction by PCA but the number of PCs employed is unclear to us. The top five PCs were used in kPCA [21]. Therefore, even in cases where more sophisticated methods than standard PCA are used, PCA can be used for dimensionality reduction and as a reference for the comparison of other PCA-related methods.

Author Contributions

Conceptualization, A.K.; investigation, A.K.; writing—original draft preparation, review and editing, A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by MEXT/JSPS KAKENHI Nos. JP19H03191, JP20H05439, JP21H05510, and JP22H04745, and by MEXT as a “Program for Promoting Researches on the Supercomputer Fugaku” (Application of Molecular Dynamics Simulation to Precision Medicine Using Big Data Integration System for Drug Discovery, JPMXP1020200201, and Biomolecular Dynamics in a Living Cell, JPMXP1020200101) to A.K.

Conflicts of Interest

The author declares no conflict of interest.

References

Pearson, K.L., III. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef] [Green Version]
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
Mccammon, J.A.; Gelin, B.R.; Karplus, M. Dynamics of Folded Proteins. Nature 1977, 267, 585–590. [Google Scholar] [CrossRef] [PubMed]
Go, N.; Noguti, T.; Nishikawa, T. Dynamics of a Small Globular Protein in Terms of Low-Frequency Vibrational-Modes. Proc. Natl. Acad. Sci. USA 1983, 80, 3696–3700. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Levitt, M.; Sander, C.; Stern, P.S. The normal modes of a protein: Native bovine pancreatic trypsin inhibitor. Int. J. Quant. Chem. 1983, 24, 181–199. [Google Scholar] [CrossRef]
Brooks, B.; Karplus, M. Harmonic dynamics of proteins: Normal modes and fluctuations in bovine pancreatic trypsin inhibitor. Proc. Natl. Acad. Sci. USA 1983, 80, 6571–6575. [Google Scholar] [CrossRef] [Green Version]
Richards, F.M. The interpretation of protein structures: Total volume, group volume distributions and packing density. J. Mol. Biol. 1974, 82, 1–14. [Google Scholar] [CrossRef]
Karplus, M.; Kushick, J.N. Method for estimating the configurational entropy of macromolecules. Macromolecules 1981, 14, 325–332. [Google Scholar] [CrossRef]
Levy, R.M.; Srinivasan, A.R.; Olson, W.K.; McCammon, J.A. Quasi-harmonic method for studying very low frequency modes in proteins. Biopolymers 1984, 23, 1099–1112. [Google Scholar] [CrossRef]
Levy, R.M.; Rojas, O.D.; Friesner, R.A. Quasi-Harmonic Method for Calculating Vibrational-Spectra from Classical Simulations on Multidimensional Anharmonic Potential Surfaces. J. Phys. Chem. 1984, 88, 4233–4238. [Google Scholar] [CrossRef]
Horiuchi, T.; Go, N. Projection of Monte Carlo and molecular dynamics trajectories onto the normal mode axes: Human lysozyme. Proteins Struct. Funct. Genet. 1991, 10, 106–116. [Google Scholar] [CrossRef]
Kitao, A.; Hirata, F.; Gō, N. The effects of solvent on the conformation and the collective motions of protein: Normal mode analysis and molecular dynamics simulations of melittin in water and in vacuum. Chem. Phys. 1991, 158, 447–472. [Google Scholar] [CrossRef]
García, A.E. Large-Amplitude Nonlinear Motions in Proteins. Phys. Rev. Lett. 1992, 68, 2696–2699. [Google Scholar] [CrossRef] [PubMed]
Amadei, A.; Linssen, A.B.M.; Berendsen, H.J.C. Essential Dynamics of Proteins. Proteins Struct. Funct. Genet. 1993, 17, 412–425. [Google Scholar] [CrossRef] [PubMed]
Hayward, S.; Kitao, A.; Hirata, F.; Go, N. Effect of solvent on collective motions in globular protein. J. Mol. Biol. 1993, 234, 1207–1217. [Google Scholar] [CrossRef]
Maisuradze, G.G.; Liwo, A.; Scheraga, H.A. Principal component analysis for protein folding dynamics. J. Mol. Biol. 2009, 385, 312–329. [Google Scholar] [CrossRef] [Green Version]
Maisuradze, G.G.; Liwo, A.; Senet, P.; Scheraga, H.A. Local vs global motions in protein folding. J. Chem. Theory Comput. 2013, 9, 2907–2921. [Google Scholar] [CrossRef] [Green Version]
Hayward, S.; Go, N. Collective Variable Description of Native Protein Dynamics. Annu. Rev. Phys. Chem. 1995, 46, 223–250. [Google Scholar] [CrossRef]
Kitao, A.; Go, N. Investigating protein dynamics in collective coordinate space. Curr. Opin. Struct. Biol. 1999, 9, 164–169. [Google Scholar] [CrossRef]
Berendsen, H.J.C.; Hayward, S. Collective protein dynamics in relation to function. Curr. Opin. Struct. Biol. 2000, 10, 165–169. [Google Scholar] [CrossRef]
David, C.C.; Jacobs, D.J. Principal component analysis: A method for determining the essential dynamics of proteins. Methods Mol. Biol. 2014, 1084, 193–226. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kitao, A.; Takemura, K. High anisotropy and frustration: The keys to regulating protein function efficiently in crowded environments. Curr. Opin. Struct. Biol. 2017, 42, 50–58. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sittel, F.; Stock, G. Perspective: Identification of collective variables and metastable states of protein dynamics. J. Chem. Phys. 2018, 149, 150901. [Google Scholar] [CrossRef] [PubMed]
Davis, I.W.; Arendall, W.B., 3rd; Richardson, D.C.; Richardson, J.S. The backrub motion: How protein backbone shrugs when a sidechain dances. Structure 2006, 14, 265–274. [Google Scholar] [CrossRef] [Green Version]
Hayward, S. Peptide-plane flipping in proteins. Protein Sci. 2001, 10, 2219–2227. [Google Scholar] [CrossRef]
Nishima, W.; Qi, G.; Hayward, S.; Kitao, A. DTA: Dihedral transition analysis for characterization of the effects of large main-chain dihedral changes in proteins. Bioinformatics 2009, 25, 628–635. [Google Scholar] [CrossRef] [Green Version]
Kitao, A.; Hayward, S.; Go, N. Energy landscape of a native protein: Jumping-among-minima model. Proteins Struct. Funct. Genet. 1998, 33, 496–517. [Google Scholar] [CrossRef]
Joti, Y.; Kitao, A.; Go, N. Protein boson peak originated from hydration-related multiple minima energy landscape. J. Am. Chem. Soc. 2005, 127, 8705–8709. [Google Scholar] [CrossRef]
Kitao, A.; Wagner, G. A space-time structure determination of human CD2 reveals the CD58-binding mode. Proc. Natl. Acad. Sci. USA 2000, 97, 2064–2068. [Google Scholar] [CrossRef] [Green Version]
Kitao, A.; Wagner, G. Amplitudes and directions of internal protein motions from a JAM analysis of 15N relaxation data. Magn. Reson. Chem. 2006, 44, S130–S142. [Google Scholar] [CrossRef]
Hess, B. Similarities between principal components of protein dynamics and random diffusion. Phys. Rev. E 2000, 62, 8438–8448. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Edelman, A.; Rao, N.R. Random matrix theory. Acta Numer. 2005, 14, 233–297. [Google Scholar] [CrossRef] [Green Version]
Kwapień, J.; Drożdż, S. Physical approach to complex systems. Phys. Rep. 2012, 515, 115–226. [Google Scholar] [CrossRef]
Palese, L.L. Random Matrix Theory in molecular dynamics analysis. Biophys. Chem. 2015, 196, 1–9. [Google Scholar] [CrossRef] [PubMed]
Palese, L.L. A random version of principal component analysis in data clustering. Comput. Biol. Chem. 2018, 73, 57–64. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Cossio-Perez, R.; Palma, J.; Pierdominici-Sottile, G. Consistent Principal Component Modes from Molecular Dynamics Simulations of Proteins. J. Chem. Inf. Model. 2017, 57, 826–834. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hayward, S.; Kitao, A.; Go, N. Harmonicity and anharmonicity in protein dynamics: A normal mode analysis and principal component analysis. Proteins Struct. Funct. Genet. 1995, 23, 177–186. [Google Scholar] [CrossRef]
Bahar, I.; Rader, A.J. Coarse-grained normal mode analysis in structural biology. Curr. Opin. Struct. Biol. 2005, 15, 586–592. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ma, J. Usefulness and limitations of normal mode analysis in modeling dynamics of biomolecular complexes. Structure 2005, 13, 373–380. [Google Scholar] [CrossRef] [Green Version]
Normal Mode Analysis: Theory and Applications to Biological and Chemical Systems; Cui, Q. , Bahar, I., Eds.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2006. [Google Scholar]
Dykeman, E.C.; Sankey, O.F. Normal mode analysis and applications in biological physics. J. Phys. Condens. Matter 2010, 22, 423202. [Google Scholar] [CrossRef]
Yamato, T.; Laprevote, O. Normal mode analysis and beyond. Biophys. Physicobiol. 2019, 16, 322–327. [Google Scholar] [CrossRef] [Green Version]
Jacob, A.B.; Vladena, B.-H. Normal Mode Analysis: A Tool for Better Understanding Protein Flexibility and Dynamics with Application to Homology Models. In Homology Molecular Modeling; Rafael Trindade, M., de Moraes, F.R.M., Magnólia, C., Eds.; IntechOpen: Rijeka, Croatia, 2021. [Google Scholar]
Moritsugu, K.; Smith, J.C. Langevin model of the temperature and hydration dependence of protein vibrational dynamics. J. Phys. Chem. B 2005, 109, 12182–12194. [Google Scholar] [CrossRef]
Moritsugu, K.; Smith, J.C. Temperature-dependent protein dynamics: A simulation-based probabilistic diffusion-vibration Langevin description. J. Phys. Chem. B 2006, 110, 5807–5816. [Google Scholar] [CrossRef]
Lamm, G.; Szabo, A. Langevin Modes of Macromolecules. J. Chem. Phys. 1986, 85, 7334–7348. [Google Scholar] [CrossRef]
Kottalam, J.; Case, D.A. Langevin Modes of Macromolecules—Applications to Crambin and DNA Hexamers. Biopolymers 1990, 29, 1409–1421. [Google Scholar] [CrossRef]
Kirkwood, J.G.; Riseman, J. The Intrinsic Viscosities and Diffusion Constants of Flexible Macromolecules in Solution. J. Chem. Phys. 1948, 16, 565–573. [Google Scholar] [CrossRef]
Kirkwood, J.G. The statistical mechanical theory of irreversible processes in solutions of flexible macromolecules. Visco-elastic behavior. Recl. Trav. Chim. Pays-Bas 1949, 68, 649–660. [Google Scholar] [CrossRef]
Rotne, J.; Prager, S. Variational Treatment of Hydrodynamic Interaction in Polymers. J. Chem. Phys. 1969, 50, 4831–4837. [Google Scholar] [CrossRef]
Kim, B.; Hirata, F. Structural fluctuation of protein in water around its native state: A new statistical mechanics formulation. J. Chem. Phys. 2013, 138, 054108. [Google Scholar] [CrossRef] [Green Version]
Hirata, F.; Kim, B. Multi-scale dynamics simulation of protein based on the generalized Langevin equation combined with 3D-RISM theory. J. Mol. Liq. 2016, 217, 23–28. [Google Scholar] [CrossRef]
Chong, S.-H.; Hirata, F. Dynamics of solvated ion in polar liquids: An interaction-site-model description. J. Chem. Phys. 1998, 108, 7339–7349. [Google Scholar] [CrossRef]
Chong, S.-H.; Hirata, F. Dynamics of ions in liquid water: An interaction-site-model description. J. Chem. Phys. 1999, 111, 3654–3667. [Google Scholar] [CrossRef]
Hirata, F. On the interpretation of the temperature dependence of the mean square displacement (MSD) of protein, obtained from the incoherent neutron scattering. J. Mol. Liq. 2018, 270, 218–226. [Google Scholar] [CrossRef]
Hayward, S.; Kitao, A.; Go, N. Harmonic and anharmonic aspects in the dynamics of BPTI: A normal mode analysis and principal component analysis. Protein Sci. 1994, 3, 936–943. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Eckart, C. Some studies concerning rotating axes and polyatomic molecules. Phys. Rev. 1935, 47, 552–558. [Google Scholar] [CrossRef]
Kabsch, W. Solution for Best Rotation to Relate 2 Sets of Vectors. Acta Crystallogr. A 1976, 32, 922–923. [Google Scholar] [CrossRef]
Omori, S.; Fuchigami, S.; Ikeguchi, M.; Kidera, A. Linear response theory in dihedral angle space for protein structural change upon ligand binding. J. Comput. Chem. 2009, 30, 2602–2608. [Google Scholar] [CrossRef]
Omori, S.; Fuchigami, S.; Ikeguchi, M.; Kidera, A. Latent dynamics of a protein molecule observed in dihedral angle space. J. Chem. Phys. 2010, 132, 115103. [Google Scholar] [CrossRef]
Mu, Y.G.; Nguyen, P.H.; Stock, G. Energy landscape of a small peptide revealed by dihedral angle principal component analysis. Proteins 2005, 58, 45–52. [Google Scholar] [CrossRef]
Altis, A.; Nguyen, P.H.; Hegger, R.; Stock, G. Dihedral angle principal component analysis of molecular dynamics simulations. J. Chem. Phys. 2007, 126, 244111. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Altis, A.; Otten, M.; Nguyen, P.H.; Hegger, R.; Stock, G. Construction of the free energy landscape of biomolecules via dihedral angle principal component analysis. J. Chem. Phys. 2008, 128, 245102. [Google Scholar] [CrossRef] [PubMed]
Sittel, F.; Jain, A.; Stock, G. Principal component analysis of molecular dynamics: On the use of Cartesian vs. internal coordinates. J. Chem. Phys. 2014, 141, 014111. [Google Scholar] [CrossRef] [PubMed]
Huckemann, S.; Ziezold, H. Principal component analysis for Riemannian manifolds, with an application to triangular shape spaces. Adv. Appl. Probab. 2006, 38, 299–319. [Google Scholar] [CrossRef]
Sargsyan, K.; Wright, J.; Lim, C. GeoPCA: A new tool for multivariate analysis of dihedral angles based on principal component geodesics. Nucleic Acids Res. 2012, 40, e25, Erratum in Nucleic Acids Res. 2015, 43, 10571–10572. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Nodehi, A.; Golalizadeh, M.; Heydari, A. Dihedral angles principal geodesic analysis using nonlinear statistics. J. Appl. Stat. 2015, 42, 1962–1972. [Google Scholar] [CrossRef]
Eltzner, B.; Huckemann, S.; Mardia, K.V. Torus principal component analysis with applications to RNA structure. J. Appl. Stat. 2018, 12, 1332–1359. [Google Scholar] [CrossRef] [Green Version]
Sittel, F.; Filk, T.; Stock, G. Principal component analysis on a torus: Theory and application to protein dynamics. J. Chem. Phys. 2017, 147, 244101. [Google Scholar] [CrossRef]
Post, M.; Wolf, S.; Stock, G. Principal component analysis of nonequilibrium molecular dynamics simulations. J. Chem. Phys. 2019, 150, 204110. [Google Scholar] [CrossRef] [Green Version]
Abagyan, R.; Argos, P. Optimal protocol and trajectory visualization for conformational searches of peptides and proteins. J. Mol. Biol. 1992, 225, 519–532. [Google Scholar] [CrossRef]
David, C.C.; Singam, E.R.A.; Jacobs, D.J. JED: A Java Essential Dynamics Program for comparative analysis of protein trajectories. BMC Bioinform. 2017, 18, 271. [Google Scholar] [CrossRef] [Green Version]
Ernst, M.; Sittel, F.; Stock, G. Contact- and distance-based principal component analysis of protein dynamics. J. Chem. Phys. 2015, 143, 244114. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Heringa, J.; Argos, P. Side-chain clusters in protein structures and their role in protein folding. J. Mol. Biol. 1991, 220, 151–171. [Google Scholar] [CrossRef]
Ogata, K. Investigation of Cooperative Modes for Collective Molecules Using Grid-Based Principal Component Analysis. J. Phys. Chem. B 2021, 125, 1072–1084. [Google Scholar] [CrossRef]
Beattie, J.R.; Esmonde-White, F.W.L. Exploration of Principal Component Analysis: Deriving Principal Component Analysis Visually Using Spectra. Appl. Spectrosc. 2021, 75, 361–375. [Google Scholar] [CrossRef] [PubMed]
Cochran, R.N.; Horne, F.H. Strategy for resolving rapid scanning wavelength experiments by principal component analysis. J. Phys. Chem. 1980, 84, 2561–2567. [Google Scholar] [CrossRef]
Cochran, R.N.; Horne, F.H.; Dye, J.L.; Ceraso, J.; Suelter, C.H. Principal component analysis of rapid scanning wavelength stopped-flow kinetics experiments on the liver alcohol dehydrogenase catalyzed reduction of p-nitroso-N,N-dimethylaniline by 1,4-dihydronicotinamide adenine dinucleotide. J. Phys. Chem. 1980, 84, 2567–2575. [Google Scholar] [CrossRef]
Yuan, B.; Murayama, K.; Wu, Y.; Tsenkova, R.; Dou, X.; Era, S.; Ozaki, Y. Temperature-dependent near-infrared spectra of bovine serum albumin in aqueous solutions: Spectral analysis by principal component analysis and evolving factor analysis. Appl. Spectrosc. 2003, 57, 1223–1229. [Google Scholar] [CrossRef]
Sakurai, K.; Goto, Y. Principal component analysis of the pH-dependent conformational transitions of bovine beta-lactoglobulin monitored by heteronuclear NMR. Proc. Natl. Acad. Sci. USA 2007, 104, 15346–15351. [Google Scholar] [CrossRef] [Green Version]
Henry, E.R. The Use of Matrix Methods in the Modeling of Spectroscopic Data Sets. Biophys. J. 1997, 72, 652–673. [Google Scholar] [CrossRef] [Green Version]
Shrager, R.I.; Hendler, R.W. Titration of individual components in a mixture with resolution of difference spectra, pKs, and redox transitions. Anal. Chem. 2002, 54, 1147–1152. [Google Scholar] [CrossRef]
Hofrichter, J.; Sommer, J.H.; Henry, E.R.; Eaton, W.A. Nanosecond absorption spectroscopy of hemoglobin: Elementary processes in kinetic cooperativity. Proc. Natl. Acad. Sci. USA 1983, 80, 2235–2239. [Google Scholar] [CrossRef] [Green Version]
Schmidt, M.; Rajagopal, S.; Ren, Z.; Moffat, K. Application of Singular Value Decomposition to the Analysis of Time-Resolved Macromolecular X-Ray Data. Biophys. J. 2003, 84, 2112–2129. [Google Scholar] [CrossRef] [Green Version]
Rajagopal, S.; Schmidt, M.; Anderson, S.; Ihee, H.; Moffat, K. Analysis of experimental time-resolved crystallographic data by singular value decomposition. Acta Crystallogr. D 2004, 60, 860–871. [Google Scholar] [CrossRef]
Kostov, K.S.; Moffat, K. Cluster analysis of time-dependent crystallographic data: Direct identification of time-independent structural intermediates. Biophys. J. 2011, 100, 440–449. [Google Scholar] [CrossRef] [Green Version]
Kubo, R. The fluctuation-dissipation theorem. Rep. Prog. Phys. 1966, 29, 255–284. [Google Scholar] [CrossRef] [Green Version]
Des Cloizeaux, D. Linear Response, Generalized Susceptibility and Dispersion Theory. In Theory of Condensed Matter; Bassani, F., Caglioti, G., Ziman, J., Eds.; International Center for Theretical Physics: Trieste, Italy, 1968; pp. 325–354. [Google Scholar]
Ikeguchi, M.; Ueno, J.; Sato, M.; Kidera, A. Protein structural change upon ligand binding: Linear response theory. Phys. Rev. Lett. 2005, 94, 078102. [Google Scholar] [CrossRef]
Yang, L.W.; Kitao, A.; Huang, B.C.; Go, N. Ligand-Induced Protein Responses and Mechanical Signal Propagation Described by Linear Response Theories. Biophys. J. 2014, 107, 1415–1425. [Google Scholar] [CrossRef] [Green Version]
Hirata, F. A molecular theory of the structural dynamics of protein induced by a perturbation. J. Chem. Phys. 2016, 145, 234106. [Google Scholar] [CrossRef]
Kitao, A. Transform and relax sampling for highly anisotropic systems: Application to protein domain motion and folding. J. Chem. Phys. 2011, 135, 045101, Erratum in J. Chem. Phys. 2011, 135, 119903. [Google Scholar] [CrossRef]
Tamura, K.; Hayashi, S. Linear Response Path Following: A Molecular Dynamics Method To Simulate Global Conformational Changes of Protein upon Ligand Binding. J. Chem. Theory Comput. 2015, 11, 2900–2917. [Google Scholar] [CrossRef]
Tamura, K.; Hayashi, S. Atomistic modeling of alternating access of a mitochondrial ADP/ATP membrane transporter with molecular simulations. PLoS ONE 2017, 12, e0181489. [Google Scholar] [CrossRef]
Jutten, C.; Herault, J. Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Process. 1991, 24, 1–10. [Google Scholar] [CrossRef]
Comon, P. Independent component analysis, A new concept? Signal Process. 1994, 36, 287–314. [Google Scholar] [CrossRef]
Hyvärinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 2000, 13, 411–430. [Google Scholar] [CrossRef] [Green Version]
Lange, O.F.; Grubmüller, H. Full correlation analysis of conformational protein dynamics. Proteins 2008, 70, 1294–1312. [Google Scholar] [CrossRef]
Nguyen, P.H. Conformational states and folding pathways of peptides revealed by principal-independent component analyses. Proteins 2007, 67, 579–592. [Google Scholar] [CrossRef]
Sakuraba, S.; Joti, Y.; Kitao, A. Detecting coupled collective motions in protein by independent subspace analysis. J. Chem. Phys. 2010, 133, 185102. [Google Scholar] [CrossRef]
Theis, F.J. Towards a general independent subspace analysis. In Advances in Neural Information Processing Systems; Schölkopf, B., Platt, J., Hoffman, T.E., Eds.; MIT Press: Cambridge, MA, USA, 2007; Volume 19, pp. 1361–1368. [Google Scholar]
Nguyen, P.H. Complexity of free energy landscapes of peptides revealed by nonlinear principal component analysis. Proteins 2006, 65, 898–913. [Google Scholar] [CrossRef]
Kramer, M.A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 1991, 37, 233–243. [Google Scholar] [CrossRef]
Schölkopf, B.; Smola, A.; Müller, K.-R. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Comput. 1998, 10, 1299–1319. [Google Scholar] [CrossRef] [Green Version]
Coifman, R.R.; Lafon, S.; Lee, A.B.; Maggioni, M.; Nadler, B.; Warner, F.; Zucker, S.W. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proc. Natl. Acad. Sci. USA 2005, 102, 7426–7431. [Google Scholar] [CrossRef] [Green Version]
Coifman, R.R.; Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 2006, 21, 5–30. [Google Scholar] [CrossRef] [Green Version]
de la Portey, J.; Herbsty, B.M.; Hereman, W.; van der Walty, S.J. An Introduction to Diffusion Maps. In Proceedings of the The 19th Symposium of the Pattern Recognition Association of South Africa (PRASA 2008), Cape Town, South Africa, 27–28 November 2008. [Google Scholar]
Ferguson, A.L.; Zhang, S.; Dikiy, I.; Panagiotopoulos, A.Z.; Debenedetti, P.G.; James Link, A. An experimental and computational investigation of spontaneous lasso formation in microcin J25. Biophys. J. 2010, 99, 3056–3065. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kim, S.B.; Dsilva, C.J.; Kevrekidis, I.G.; Debenedetti, P.G. Systematic characterization of protein folding pathways using diffusion maps: Application to Trp-cage miniprotein. J. Chem. Phys. 2015, 142, 085101. [Google Scholar] [CrossRef] [Green Version]
Trstanova, Z.; Leimkuhler, B.; Lelièvre, T. Local and global perspectives on diffusion maps in the analysis of molecular systems. Proc. R. Soc. A Math. Phys. Eng. Sci. 2020, 476, 20190036. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hus, J.C.; Bruschweiler, R. Principal component method for assessing structural heterogeneity across multiple alignment media. J. Biomol. NMR 2002, 24, 123–132. [Google Scholar] [CrossRef] [PubMed]
Howe, P.W. Principal components analysis of protein structure ensembles calculated using NMR data. J. Biomol. NMR 2001, 20, 61–70. [Google Scholar] [CrossRef]
Yang, L.W.; Eyal, E.; Bahar, I.; Kitao, A. Principal component analysis of native ensembles of biomolecular structures (PCA_NEST): Insights into functional dynamics. Bioinformatics 2009, 25, 606–614, Erratum in Bioinformatics 2009, 25, 2147–2147. [Google Scholar] [CrossRef]
Sakuraba, S.; Kono, H. Spotting the difference in molecular dynamics simulations of biomolecules. J. Chem. Phys. 2016, 145, 074116. [Google Scholar] [CrossRef] [Green Version]
Wang, H.; Yan, S.; Xu, D.; Tang, X.; Huang, T. Trace Ratio vs. Ratio Trace for Dimensionality Reduction. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, 17–22 June 2007; pp. 1–8. [Google Scholar]
Ngo, T.T.; Bellalij, M.; Saad, Y. The Trace Ratio Optimization Problem. SIAM Rev. 2012, 54, 545–569. [Google Scholar] [CrossRef] [Green Version]
Peters, J.H.; de Groot, B.L. Ubiquitin dynamics in complexes reveal molecular recognition mechanisms beyond induced fit and conformational selection. PLoS Comput. Biol. 2012, 8, e1002704. [Google Scholar] [CrossRef] [PubMed]
Ahmad, M.; Helms, V.; Kalinina, O.V.; Lengauer, T. Relative Principal Components Analysis: Application to Analyzing Biomolecular Conformational Changes. J. Chem. Theory Comput. 2019, 15, 2166–2178. [Google Scholar] [CrossRef] [PubMed]
Molgedey, L.; Schuster, H.G. Separation of a Mixture of Independent Signals Using Time-Delayed Correlations. Phys. Rev. Lett. 1994, 72, 3634–3637. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Naritomi, Y.; Fuchigami, S. Slow dynamics in protein fluctuations revealed by time-structure based independent component analysis: The case of domain motions. J. Chem. Phys. 2011, 134, 065101. [Google Scholar] [CrossRef]
Naritomi, Y.; Fuchigami, S. Slow dynamics of a protein backbone in molecular dynamics simulation revealed by time-structure based independent component analysis. J. Chem. Phys. 2013, 139, 215102. [Google Scholar] [CrossRef]
Mori, T.; Saito, S. Dynamic heterogeneity in the folding/unfolding transitions of FiP35. J. Chem. Phys. 2015, 142, 135101. [Google Scholar] [CrossRef]
Takano, H.; Miyashita, S. Relaxation Modes in Random Spin Systems. J. Phys. Soc. Jpn. 1995, 64, 3688–3698. [Google Scholar] [CrossRef]
Hirao, H.; Koseki, S.; Takano, H. Molecular Dynamics Study of Relaxation Modes of a Single Polymer Chain. J. Phys. Soc. Jpn. 1997, 66, 3399–3405. [Google Scholar] [CrossRef]
Koseki, S.; Hirao, H.; Takano, H. Monte Carlo Study of Relaxation Modes of a Single Polymer Chain. J. Phys. Soc. Jpn. 1997, 66, 1631–1637. [Google Scholar] [CrossRef]
Mitsutake, A.; Iijima, H.; Takano, H. Relaxation mode analysis of a peptide system: Comparison with principal component analysis. J. Chem. Phys. 2011, 135, 164102. [Google Scholar] [CrossRef] [PubMed]
Mitsutake, A.; Takano, H. Relaxation mode analysis and Markov state relaxation mode analysis for chignolin in aqueous solution near a transition temperature. J. Chem. Phys. 2015, 143, 124111. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Karasawa, N.; Mitsutake, A.; Takano, H. Two-step relaxation mode analysis with multiple evolution times applied to all-atom molecular dynamics protein simulation. Phys. Rev. E 2017, 96, 062408. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Schultze, S.; Grubmüller, H. Time-Lagged Independent Component Analysis of Random Walks and Protein Dynamics. J. Chem. Theory Comput. 2021, 17, 5766–5776. [Google Scholar] [CrossRef] [PubMed]
Morishita, T. Time-dependent principal component analysis: A unified approach to high-dimensional data reduction using adiabatic dynamics. J. Chem. Phys. 2021, 155, 134114. [Google Scholar] [CrossRef]
Perez-Hernandez, G.; Paul, F.; Giorgino, T.; De Fabritiis, G.; Noe, F. Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys. 2013, 139, 015102. [Google Scholar] [CrossRef]
Scherer, M.K.; Trendelkamp-Schroer, B.; Paul, F.; Perez-Hernandez, G.; Hoffmann, M.; Plattner, N.; Wehmeyer, C.; Prinz, J.H.; Noe, F. PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models. J. Chem. Theory Comput. 2015, 11, 5525–5542. [Google Scholar] [CrossRef]
Harrigan, M.P.; Sultan, M.M.; Hernandez, C.X.; Husic, B.E.; Eastman, P.; Schwantes, C.R.; Beauchamp, K.A.; McGibbon, R.T.; Pande, V.S. MSMBuilder: Statistical Models for Biomolecular Dynamics. Biophys. J. 2017, 112, 10–15. [Google Scholar] [CrossRef] [Green Version]
Schwantes, C.R.; Pande, V.S. Modeling molecular kinetics with tICA and the kernel trick. J. Chem. Theory Comput. 2015, 11, 600–608. [Google Scholar] [CrossRef] [Green Version]
Husic, B.E.; Pande, V.S. Markov State Models: From an Art to a Science. J. Am. Chem. Soc. 2018, 140, 2386–2396. [Google Scholar] [CrossRef]
Wang, X.; Unarta, I.C.; Cheung, P.P.; Huang, X. Elucidating molecular mechanisms of functional conformational changes of proteins via Markov state models. Curr. Opin. Struct. Biol. 2021, 67, 69–77. [Google Scholar] [CrossRef]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kitao, A. Principal Component Analysis and Related Methods for Investigating the Dynamics of Biological Macromolecules. J 2022, 5, 298-317. https://0-doi-org.brum.beds.ac.uk/10.3390/j5020021

AMA Style

Kitao A. Principal Component Analysis and Related Methods for Investigating the Dynamics of Biological Macromolecules. J. 2022; 5(2):298-317. https://0-doi-org.brum.beds.ac.uk/10.3390/j5020021

Chicago/Turabian Style

Kitao, Akio. 2022. "Principal Component Analysis and Related Methods for Investigating the Dynamics of Biological Macromolecules" J 5, no. 2: 298-317. https://0-doi-org.brum.beds.ac.uk/10.3390/j5020021

Article Menu

Principal Component Analysis and Related Methods for Investigating the Dynamics of Biological Macromolecules

Abstract

1. Historical Overview

2. Basic Concept behind PCA

3. Error in PCA

4. Relation with NMA

5. Solvent and Other Environmental Effects on Macromolecular Dynamics

6. Choice of Variables and Spaces for Better Representation of Macromolecular Dynamics in PCA

7. The Fluctuation–Dissipation Theorem, Linear Response Theory and PCA

8. Non-Gaussianity and Non-Linearity in PCA

9. Detecting Data Differences by PCA and Related Methods

10. Time Evolution of Collective Variables

11. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI