#### Appendix B.1. Short Outlines of the Principles and Applications of Multivariate (MV) Analysis Techniques

Employed in Chemometrics and Metabolomics Investigations.

For the benefit of readers who are unfamiliar with MV statistical analysis techniques frequently employed in ‘state-of-the-art’ metabolomics and chemometrics investigations, an outline of some of the more commonly used approaches are provided below in

Appendix B.1.1–

Appendix B.1.6, Readers are referred to refs. [

39,

40] for further information.

#### Appendix B.1.1. Multivariate Analysis-of-Variance (MANOVA)

MANOVA represents a generalisation of ANOVA which extends to the study of several outcome variables simultaneously, and is generally applicable to datasets which have several inter-related outcome variables which cannot be satisfactorily addressed by consideration of only a single variable. In such cases where the outcome variables are correlated, analysis of these independently as single variable systems usually provides models which are highly unsatisfactory, and therefore an alternative analysis, which considers all response variables simultaneously, is required. Hence, when applied to such systems, MANOVA serves to explore MV outcomes observed for all explanatory variables considered, for example age band/age, gender/gender score and their first-order interaction effects as possible contributors to 3 molecularly-distinct VSC levels as in this study. If any of the above 3 omnibus MV evaluations (specifically, the main factor effects of gender score/gender and age band/age, together with the gender band/gender x age band/age interaction sources of variation) are statistically significant, subsequent post-hoc tests may be performed to determine the sources of group differences (Wilks’ test employed here is one example).

#### Appendix B.1.2. Principal Component Analysis (PCA)

PCA is an unsupervised MV analysis process which involves the computation of a relatively small number of principal components (PCs) from datasets usually containing large numbers of possible individual ‘predictor’ variables. Each PC consists a linear combination of a series of such predictors which are correlated with each other, either positively or negatively, and either strongly or moderately so. The first, most prominent PC (PC1) accounts for the greatest amount of dataset variation, and this is then followed by the second PC (PC2), which contains a second set of correlated variables and is orthogonal to (i.e., completely uncorrelated with) PC1, and which accounts for the second largest amount of dataset variation, Likewise for the third, fourth and fifth PCs (PC3, PC4 and PC5 resepctively), along to the nth PC value. Hence, PCA is a technique that can condense medium, large or very large-sized datasets, sometimes with several thousand possible predictor variables, to a much smaller number of uncorrelated PCs, each one containing a combination of n correlated, possible predictor molecules. Hence, this strategy, and also that of factor analysis, has the ability to determine differential sources of or explanations for orthogonal PCs. Variables such as biomolecule concentrations which load strongly on the same PC often provide similar or related information, diagnostic or otherwise, since they are correlated with each other and, therefore, they may arise from the same source, for example the same metabolic pathway in metabolomics investigations. Hence, PCA is used to effectively alter the basis of complex large predictor variable datasets, frequently by employing only the first 2, 3 or 4 PCs and disregarding the remainder.

PCA is often used for the purpose of exploratory data analysis, for generating valuable predictive models, and as a primary MV analysis approach for dataset ‘policing’, the latter including the detection of statistical ‘outlier’ samples, i.e., those which do not ‘fit-in’ with, or appear ‘foreign’ to, the bulk of those remaining. The dimensionality reduction achieved via the projection of sample data points (e.g., individual VSC concentrations as in this study) onto up to 5 or less of the first, most important PCs results in a much lower-dimensional dataset, whilst maintaining as much of the original’s total variation as possible. PC1 can alternatively be viewed as a ‘direction’ which maximizes projected data variance, while {\displaystyle i^{\text{th}}}PC2 may be perceived as a direction ‘orthogonal’ to {\displaystyle i-1}PC1, which likewise also maximizes projected dataset variance.

In this investigation, which is actually a very simple dataset because it features a total of only 3 VSC variables, both H_{2}S and CH_{3}SH strongly loaded onto PC1 since they were strongly positively correlated, whereas (CH_{3})_{2}S, which was largely uncorrelated with both these PC1-loading VSCs, loaded onto PC2 alone. In PCA, eigenvalues represent the mean number of original variables loading onto PC1, PC2, PC3, etc., and hence for the dataset explored here, this parameter was 1.81 (close to 2.00) for PC1, which contains 2 strongly-loading VSCs (H_{2}S and CH_{3}SH), whereas for PC2, the eigenvalue was 0.97, a value very close to 1.00 since only 1 VSC strongly loads thereon ((CH_{3})_{2}S alone). In more complex PCA models with larger or much larger numbers of potential predictor variables, the significance of individual variables loading on selected PCs is usually determined by a consideration of their PC loadings vectors: those lower or greater than generally accepted threshold limit values of –0.40 (for negative correlations with that PC) or +0.40 (for positive correlations with that PC), respectively, are viewed as significant. Corresponding squared cosine values of these loadings are also used for this purpose.

In PCA, a Varimax rotation represents a modification of coordinates which maximizes the sum of the variances of squared variable loadings. Therefore, this rotation procedure gives rise to loadings coefficients (vectors) which are usually high or close to zero, with only a limited number of intermediate values. This process is clearly suited to datasets containing only a few explanatory variables, such as the VSC dataset explored here which has 3 variables, and clear Pearson or partial correlations between only 2 of them. The major objective of this approach is to link each variable to a single PC only, so that the interpretation of PCA models are simplified.

Kaiser normalization is a method focused on the attainment of solution stabilities throughout all samples collected and analysed. With this approach, variable loadings are rescaled back to their original size following rotation, i.e., equivalent weights are given to all variables on rotation performance.

#### Appendix B.1.3. Principal Component Regression (PCR), Partial Least-Squares-Regression and -Discriminant Analysis (PLS-R and PLS-DA respectively), and Orthogonal Projections to Latent Structures-Discriminant Analysis (OPLS-DA)

Principal component regression (PCR) is a MV analysis technique which is built on PCA, and is employed for determining the statistical significance of independent explanatory (predictor) variables and estimating their regression coefficients in an OLS regression model designed to predict a quantitative dependent variable, e.g., actual participant age or gender score (0 for females and +1 for males) in the reported investigation. However, instead of regressing the dependent quantitative variable on the explanatory variables directly, sample PC score values of PCs arising from the individual explanatory variables are employed as regressors in the regression equation. However, typically only a sub-set of the most important PCs is used for this purpose, so that PCR is a regularized system, which also serves as a class of shrinkage estimator.

PLS-R represents a technique that is related to both PCA and PCR MV analysis strategies, and is focused on seeking maximum variance hyperplanes between a quantitative response and often many, independent, albeit explanatory variables which again are reduced in dimensionality in the form of potentially predictive orthogonal components. Indeed, it fits a linear regression model by projecting predictable quantitative (y) and known observed variables (x_{i}) onto a new ‘space’. In view of this, PLS can be classified as a bilinear factor model. Partial least squares-discriminant analysis (PLS-DA) is a variant of PLS-R which is very commonly used in metabolomics investigations when the dependent response variable or score is qualitative rather than quantitative, for example positive or negative for a particular medical condition.

PLS-R or -DA are utilised to determine fundamental relationships between explanatory variable x_{i} and response variable y matrices, and serves as a latent variable strategy for modelling the covariance structures within these two spaces. PLS models assist researchers by discovering the multidimensional direction in the x space that expounds the maximal multidimensional variance direction in the y space. These techniques are particularly valuable when the matrix of potential explanatory x variables is larger in size than the total number of observations made (n), and particularly when there is a multicollinearity (i.e., multi-correlation) amongst the x predictor variables, which is certainly not an unusual event in metabolomics investigations. As we might expect, unless careful regularisation is applied, standard multiple regression (OLS) approaches completely fail to provide successful predictive models in such situations.

Notwithstanding this, a more recently developed MV analysis strategy is orthogonal projections to latent structures-discriminant analysis (OPLS-DA), and this involves resolution and separation of a continuous explanatory variable x_{i} dataset into two factors, one containing predictive data, the other containing uncorrelated, unpredictive information, when employed to determine the nature of a discrete variable such as a disease classification. This process gives rise to an enhanced level of diagnostics in metabolomics or biomarker research, along with a more easily understandable visualization system for these effects. However, this development only improves the interpretabilities, and not the predictivities, of PLS models. Like PLS-DA, this technique is also employed extensively in the diagnosis of human diseases from datasets encompassing the biomolecular compositions of biofluids and/or tissues, and also for their prognostic stratification.

#### Appendix B.1.4. ANOVA Simultaneous Component Analysis (ASCA)

The MANOVA analysis strategy is the generalized application of an ANOVA-based experimental design to the analysis of multiple variable datasets. Although fully acceptable for investigations involving relatively small numbers of determined variables such as the 3 VSC ones explored here, the MANOVA technique is not directly applicable to more complex metabolomics datasets, which may involve very large numbers of such variables, in view of complications arising from unfulfilled assumptions and covariance matrix singularities. Moreover, in PCA, individual PCs often fail to provide well resolved information on the factors or effects involved in a UV experimental design, or their interactions; indeed, the first 3 or so PCs isolated may not be successful in capturing effects arising from any of the factors featured in the experimental design.

ASCA is a technique which is again based on PCA, and which effectively partitions total variance, and then enables interpretation of each of these partitioned variances by a simultaneous component analysis (SCA) strategy. Therefore, it serves as a MV augmentation of ANOVA, which splits variance into orthogonal model and independent (unassigned, residual) portions, the former including, for example, the possible effects ascribable to the age band and gender factors, and the age band x gender interaction contribution, investigated here. The technique involves (1) sequential decomposition of dataset variance in the context of the ANOVA-based experimental design involved; (2) application of PCA to the decomposed dataset; (3) application of corrections for data in unbalanced designs, where appropriate; and (4) selection of methods for testing the statistical significance of each effect investigated. ASCA protocols can readily cater for experimental design structures of complex MV datasets, for example those arising from metabolomics investigations with perhaps >100 potential contributory variables. With more than one emergent, differential classification system or factor for consideration, ASCA estimates such effects so that they remain uncorrelated.

#### Appendix B.1.5. Agglomerative Hierarchical Clustering (AHC)

Hierarchical clustering is quite widely utilised in MV data mining and statistical analysis, and is a form of cluster analysis which constructs a cluster hierarchy. One strategy for hierarchical clustering is the agglomerative ‘bottom-up’ approach, whilst another is the divisive ‘top-down’ strategy. In AHC ‘bottom-up’ analysis, as employed in this study, observations are primarily placed in their own distinct clusters, and subsequently cluster pairs are cumulatively merged on traversing up the hierarchy, However, in divisive hierarchical clustering analysis (DHC), all data points commence within the same large cluster, which then undergoes a successive series of splitting executions on moving down the hierarchy. For both types, clustering of the samples analysed into disease or other classification groups, or the biomolecular explanatory variables themselves (as in this study for the VSCs), may be conducted.

For AHC, cluster combinations are determined by a measure of dissimilarity between sets of observations, and this is attained via the employment of a suitable metric, i.e., a measure of distance of a line segment between 2 observation points, and known as the Euclidean distance for the analysis conducted in this study. Moreover, a linkage threshold, which defines the dissimilarity of sets as a function of the pairwise distances of observation data points in the sets, is also required. Results arising from hierarchical clustering analyses are often presented as a dendrogram, such as that shown in

Figure 8 in the current study.

#### Appendix B.1.6. Logistic Regression Analysis (logRA)

LogRA is a MV technique which was originally developed for dichotomous response variable outcomes, and is generally valuable for use in clinical trial and health science models involving categorically-defined dependent outcome variables, e.g., disease state (diseased versus healthy) and decision making (yes or no). In logRA the logarithm of the odds of a positive outcome are derived (where positive is represented by y = 1, and negative by y = 0), and an algebraic manipulation process then converts this into the probability of this outcome (in our use of this technique to categorise gender, scores of y = 0 and 1 were employed for females and males, respectively). However, multinomial logRA is applied for models with dependent variables with more than two categorical outcomes, as also employed in the current study for our attempted categorization of participant age bands from oral cavity VSC concentrations. As noted for other regression types, multinomial logRA may involve nominal and/or continuous independent variables, and relevant interactions between these explanatory variables may also be considered for improvements to the dependent variable prediction model constructed.

As with other MV techniques, although the introduction of more variables would be expected to generate a model with a better fit to the dataset, the use of an excessive number of these may improperly affect the estimated model coefficients, a process resulting in ‘overfitting’. For such binary outcome logRA models, one basic rule is that the number of the less common of the two possible outcomes divided by the number of explanatory independent variables should be ≥10 or more. Obviously, the lower the number of possible events for variables, the less reliable are regression coefficient estimates; the veracity of coefficient variances and CIs will also be unduly affected.

Interactions may be included as the product of two predictor variables, although their full consideration is usually determined by prior knowledge of the experiments and datasets involved. In this halitosis investigation, the logRA, logPCR and OLS models developed were all conducted with and without incorporation of the H_{2}S × CH_{3}SH, H_{2}S × (CH_{3})_{2}S and CH_{3}SH × (CH_{3})_{2}S first-order, two-factor interaction effects (although none of these were found to contribute towards the age/age band nor gender score/gender response outcomes, as also noted for these VSC variables when evaluated alone. For logRA models, multicollinearities of predictor variables, as observed for the strong correlation of CH_{3}SH and H_{2}S concentration variables involved, presents major problems. However, in such cases, the logPCR method may be applied—as with PCR, this strategy applies the logRA strategy to an analysis of orthogonal components rather than the individual predictors themselves, in this case the first consisting of a linear combination of correlated CH_{3}SH and H_{2}S, the second (CH_{3})_{2}S alone.