Abstract
Large datasets from omics studies need to be deeply investigated. The aim of this paper is to provide a new method (LEM method) for the search of transcriptome and metabolome connections. The heuristic algorithm here described extends the classical canonical correlation analysis (CCA) to a high number of variables (without regularization) and combines wellconditioning and fastcomputing in “R.” Reduced CCA models are summarized in PageRank matrices, the product of which gives a stochastic matrix that resumes the selfavoiding walk covered by the algorithm. Then, a homogeneous Markov process applied to this stochastic matrix converges the probabilities of interconnection between genes, providing a selection of disjointed subsets of genes. This is an alternative to regularized generalized CCA for the determination of blocks within the structure matrix. Each gene subset is thus linked to the whole metabolic or clinical dataset that represents the biological phenotype of interest. Moreover, this selection process reaches the aim of biologists who often need small sets of genes for further validation or extended phenotyping. The algorithm is shown to work efficiently on three published datasets, resulting in meaningfully broadened gene networks.
 canonical correlation analysis
 local models determination
 PageRank method
 Markov and nonMarkov chains
 structure matrix
1 INTRODUCTION
With the increase of functional genomics studies that aim at identifying key factors affected by particular physiological contexts or treatments, the combination of different omics techniques such as transcriptomics and metabolomics is becoming a popular manner to better understand the biology of complex systems. The underlying purpose of those integrative approaches is to find links between heterogeneous datasets to assess whether connecting them is a more powerful way to analyze such a large quantity of data and give them a finetuned biological sense at a higher level of understanding than using classical analyses.
Canonical correlation analysis (CCA) methods (13) have emerged as an efficient exploratory tool to correlate two datasets acquired on a same experimental unit. However, a key assumption in CCA is that the number of biological replicates has to be higher than the number of variables to correlate in each set. This assumption is never verified in costly experiments. Furthermore, CCA has been extended to generalized canonical correlation analysis (GCCA; Refs. 4, 5), to regularized canonical correlation analysis (RCCA, Refs. 11, 18, 32) and to regularized generalized CCA (RGCCA, Ref. 29) to overcome the limitations of the method (replicates numbers, vectors colinearity, more than two sets to correlate). Correlations analyses are particularly useful in nutrigenomics studies and RCCA, for example, was previously performed to link metabolism and transcriptome in mouse (21) and bovine work (31).
However, none of those methods fully answers mathematical and biological problems. For example, RGCCA reveals convergence and matrix conditioning issues because of a too global approach, which makes the results difficult to interpret. Moreover, CCA and its extended methods isolate highly correlated variables. This approach would be suitable for our purposes if the neglect of some variables was possible in the two datasets. However, in nutrigenomics studies the analysis is based on a huge transcriptomic dataset and a more restricted metabolic dataset in which parameters are biologically chosen on their relevance to represent the final phenotype of interest. This implies that metabolic variables should not be individualized but taken together since they describe a global physiological context that will be used to select for a few genes that explain the observed phenotype.
The purpose of this paper is to provide a new heuristic method that gives the linear links between two sets of column vectors in which each row is issued from measurements of an individual data. Applied to our examples the first dataset represents a set of genes; we hope that some of them will be able to give a linear explanation of the whole metabolic/clinical second set behavior. The algorithm developed here is designed to take into account the limited number of individuals by a multiple exploitation of reduced local CCA models, while the classical method would fail when applied to the entire set of genes.
The method consecutively selects restrictive subsets of vectors within the first dataset on the basis of their explicative levels, measuring the most efficient gene vectors to optimize models. By increasing a required significance level we force the process to explore the whole first dataset of gene vectors. The result is a number of nonredundant linear models. We describe here that such reduced models can be summarized in PageRank matrices (23). Thus, we obtain a stochastic matrix, product of the PageRank matrices, which is the result of the course of the algorithm.
We established as one of the main points of this paper that a Markov process applied to the previous stochastic matrix determines our gene selection method. Singularly, this process is built from a selfavoiding walk i.e., a nonMarkov process. The sequence of powers of the Markov matrix gives local extended models (LEM) that take into account the persistence of some genes through several local models (connexity). The principle of the gene selection method (or LEM method) is presented in Fig. 1.
So doing, we also propose a consistent method for structure matrix determination, while GCCA and RGCCA are unable to choose a structural partition into sets of genes and the proper weight for each of these partitions. In fact, the application of RGCCA blockwise or not (29) to our problems involves coefficients estimation of the linear combination of the whole first vector set, in the hope that some will be neglected because of their values close to zero. RGCCA transforms our mathematical problem into a continuous optimization problem solved by GaussSeidellike methods and does not take into account many interesting local models.
As proofofprinciple, the current method has been applied to two nutrigenomics studies of similar complexities, one concerning the fertility of dairy cows (31) and the other the role of PPARα in the regulation of hepatic metabolism in the mouse (19). In addition, we extended the use of the method to a larger dataset dealing with the clinical effect of varying doses of acetaminophen on rats (3), evidencing in all cases reduced computing times and increased biological outputs. In this article we will 1) develop the biological datasets used, 2) describe the mathematics supporting the algorithm, and 3) conclude on biological advantages brought by this new method. We also give an enduser “R” program (25).
2 BIOLOGICAL DATASETS
Datasets
The datasets analyzed with the gene selection algorithm are presented in Table 1. The two nutrigenomics datasets, referred to as NutriBov and NutriMous, were of similar complexities. They respectively described the effect of diets (two to five) on tissues (one to three) of dairy cows or genotyped mice (PPARα−/− vs. wild type), while measuring blood metabolites or hepatic fatty acids and concomitantly evaluating gene expression changes in the dedicated tissues: oviduct, endometrium, corpus luteum (OVI, ENDO, CL) or liver (LIV). To add upon these first datasets where a limited number of physiological measurements (6–21 metabolites) were correlated to hundreds of genes (120–293), we extended our tests to a study (LiverTox) where 10 clinical measurements were concomitantly analyzed to many more genes (3,116). NutriMous and LiverTox datasets are publicly available in the “mixOmics” package (15). For the NutriBov dataset see (31).
RCCA and nonregularized GCCA (sparse partial least squares) analysis
So far, these datasets were analyzed by RCCA or sparse partial least squares (sPLS) (11, 12, 16, 31). In our case studies (see section 5), we reanalyzed the data with RCCA or sPLS too, evidencing gene/metabolite correlations for one half or one third of the physiological or clinical measurements (NutriBov, 3 out of 6 metabolites; NutriMous, 7/21; LiverTox, 7/14) and at best, half of the interesting gene changes (NutriBov, 151/293; NutriMous, 27/120; LiverTox, 1,032/3,116). These results depended on the selected correlation thresholds. In the published reports there were as follows: NutriBov: r > 0.6, NutriMous: r > 0.5; we choose in our study r > 0.5 for LiverTox.
RCCA and sPLS limitations
However, using those methods we observed several limitations. In practice, the results obtained by RCCA were highly sensitive to the quality of the determination of the regularization parameters. Those difficulties increased positively with the size of the datasets until the convergence of the method was not ensured in a reasonable time on the LiverTox dataset (section 5.4). The sPLS method is derived from the concept of RGCCA (29) and includes the convergence and matrix conditioning issues mentioned earlier. To overcome those limitations we present a heuristic algorithm (LEM method) that doesn't need regularization and combines fast computing in R software (25). This method also aims to maximize the number of metabolites/clinical measurements to correlate to gene expression, thus proposing a datamining tool complementary to RCCA and sPLS, and biologically: new correlations and/or hypotheses.
3 GENE SELECTION ALGORITHM BASED ON CCA
3.1 Notations, Assumptions, and Objectives
Let us consider two matrices X_{1} (size n × r) and X_{2} (size n × q) respectively evaluating “gene expressions” and “metabolism patterns,” n being the number of individuals and r >> n > q. Thus, we own a huge number of r genes to test according to q metabolic variables.
When the number of individuals is smaller than the number of variables (7) or in case of strong multicolinearity in X_{1} or X_{2}, nonregularized CCA sometimes fails. Then, the determination of shrinkage estimations and shrinkage constants is needed (17).
We describe in section 3.3, a heuristic method to extend CCA to r genes instead of regularization methods. We propose to test successive classical models of CCA including p genes (p ≤ q < n ≪ r). Now, the size of X_{1} is n × p.
Let us consider the classical model of canonical analysis using the concatenated
The heuristic method allows the choice of a sufficiently small number p of variables for each performed CCA. Thus, we reasonably hope X_{1} and X_{2} to be full rank matrices [Rank(X_{1}) = p and Rank(X_{2}) = q]. We also assume that the columns of X_{1} and X_{2} are centered and normalized by using Dmetrics (here D = _{nxn}^{Id}, Euclidianmetrics).
We want to extract, among (_{p}^{r}) potential CCA models, those giving a maximal explicative level (a notion to be defined hereafter).
3.2 Explicative Level of a CCA Model, Definition
We know (27) that the number of nonzero eigenvalues in a canonical analysis is less than, or equal to, Min(p, q) = p (here p ≤ q). The multiplicity order of the eigenvalue 1 is Dim(W_{1} ∩ W_{2}), with:
We thus obtain a relation between the eigenvalues taken by the initial variables (i.e., to test) from the first set “gene expressions” (X_{1}) and those of the second set “metabolism patterns” (X_{2}).
The eigenvectors associated with an eigenvalue 0 generate the Dorthogonal parts between W_{1} and W_{2} so that the figuring vectors in one or the other part are representing totally independent variables.
Canonical analysis gives the proximity to zero (Bartlett test) for eigenvalues (27), which works with multinormalized samples (we will further suppose multinormalization of our samples).
Let us consider the whole set of eigenvalues λ_{1},…, λ_{k}, λ_{k+1},…, λ_{p} written with their multiplicity order, classified by decreasing values. We keep in mind:
If λ_{1},…, λ_{k} have a significance level ≤5% (or any acceptable value), those λ_{1},…, λ_{k} are judged significantly different from zero. Then, we can test λ_{k+1},…, λ_{p} for nullity.
We remind that the error of first kind is P(reject H_{0}H_{0} TRUE) and that the significance level is the smallest value of the error of first kind that we could have chosen while still rejecting H_{0}.
More explicitly, we own the quantity:
If the theoretical value of λ_{k+1},…, λ_{p} is zero, then for a big enough n − 2k, the real random variable Z associated with this quantity is an estimation of the proximity to zero of λ_{k+1},…,λ_{p} and it approximately follows a χ_{(p−k)(q−k)}^{2} law.
If Z takes the value z_{k}, assuming H_{0}, the probability to observe a so large value of Z is
Besides, if the columns x_{k} of X_{1} and y_{j} of X_{2} are Dnormalized, then the canonical analysis gives pairs (ξ_{i},η_{i}) of Dnormalized eigenvectors, ξ_{i} and η_{i} being associated with the same eigenvalue λ_{i} = cos^{2}(ξ_{i},η_{i}). The canonical coefficient of correlation is
Now, we only consider pairs in which the significance degree associated with the Bartlett test is ≤5% (or any acceptable value). The number of Bartlettsignificant pairs is s, with s ≤ Min(p,q).
The canonical analysis then gives the coordinates of Pr(x_{k}) and Pr(y_{j}), where Pr represents the Dorthogonal projection from ℝ^{n} to the vector space generated by these (ξ_{i}), i ∈ 〚1, s〛 or, alternatively, by these (η_{i}), i ∈ 〚1, s〛.
Then, the canonical analysis coefficients are calculated:
Since, by hypothesis, the vectors x_{k} and y_{j} are Dnormalized, α_{k}^{i} = cos(x_{k}, ξ_{i}) represents the coefficient of correlation between the initial variable x_{k} and the canonical variable ξ_{i}, (k, i) ∈ 〚1, p〛 × 〚1, s〛, similarly β_{j}^{i} = cos(y_{j}, ξ_{i}) represents the coefficient of correlation between the initial variable y_{j} and the canonical variable ξ_{i}, (j, i) ∈ 〚1,q〛 × 〚1, s〛. The same applies to the alternative choice.
For a pair of chosen values (ξ_{i}, η_{i}), if cos(ξ_{i}, η_{i}) is significantly different from zero, we will consider that the initial variable x_{k} in the first group X_{1} is related to the initial variable y_{j} in the second group X_{2} when α_{k}^{i} ≥ 0.25 (or any chosen level in [0,1], a level of
The initial variable x_{k} is the vertex of degree
of a multifunction bipartite graph. We thus obtain a directed bipartite graph with respect to the pairs of values (ξ_{i}, η_{i}) presented in Fig. 2.
Then, we assign to x_{k} its degree deg_{k}^{i} and we will call the explicative level of x_{k} in a CCA model the integer
The integer
3.3 Heuristic Strategy to optimize Local CCA Models
Integers N_{k}, k ∈ 〚1, p〛 as well as the integer N allow, for a fixed p, a strategy of choice of the most relevant CCA models among the (_{p}^{r}) possible models. Our strategy consists in choosing the models that maximize the N_{k} and in fine N. Nevertheless, calculating the integers N_{k} and N for each of the (_{p}^{r}) possible models would be extremely timeconsuming and will lead to too many results to analyze in the end.
One of the major originalities of our work is that we consequently adopt a heuristic strategy to optimize the research of genes highly correlated with metabolism. Genes enter or exit a screen of size p used to generate different local CCA models until the whole set of r genes is tested. The explicative levels of the genes are used to characterize those that are suitable to be retained or exited of the models.
This strategy is formulated as follows:
1) We rank the list of genes by increasing the index numbers: (x_{1},…, x_{r}).
2) We choose the first p genes (x_{1},…,x_{p}) and we compute the list of pairs [(x_{1}, N_{1}),…,(x_{p}, N_{p})].
3) We then rank these pairs by decreasing explicative levels (maybe with some duplicated levels).
Through a reorganization of the indexes, N_{1} ≥ … ≥ N_{k} ≥ … ≥ N_{p} will be assumed. Then, it is also assumed that we are given a certain integer threshold level N_{0}.
4) We only keep (x_{1},…, x_{k0}) in the list of genes, where k_{0} is the greatest integer such that N_{k0} ≥ N_{0}. The following nonexplored genes are now introduced to fulfill the list. By this, we obtain a pterms list (with at most p incoming terms) and we reiterate the process by recomputing the list of pairs.
However, if we choose N_{0} = 0, there is no progression of the process and we just retain the very first model given by the first p genes. To increase the explicative level of models, we propose to increment N_{0} from 0 to the first value of N_{0} until all x_{k}, k ∈ 〚1, r〛 included in the initial list have been transferred to canonical analysis.
4 MATHEMATICAL STUDY OF THE ALGORITHM
4.1 Construction of PageRank Matrices
Let us consider a directed bipartite graph made of the set Z of r + q vertices:
The set E consisting of all arcs of the graph is built as follows:
After a CCA on p genes selected among r according to the rules of the algorithm outlined above, each gene x_{k}, k ∈ 〚1, r〛 is associated with a set of metabolites
Then, we build a PageRank type stochastic square matrix G(g_{kj}) with r rows and r columns (2, 23):
We choose ρ ∈ [0, 1[.A reminder:
Note that with this writing, if x_{k} is not among the p vectors tested in the CCA necessarily for k ∈ 〚1, r〛, N_{k} = 0.
Then, will be denoted by X_{m} the real random variable corresponding to the index of a gene in the model at step m. X_{m} = z_{m}, where z_{m} ∈ 〚1, r〛.
We set:
For k ∈ 〚1, r〛, index of a tested gene,
If k is not the index of a tested gene,
The probability given to the gene j at step m + 1 knowing the probability given to the gene k at step m depends only on the ratio coming from CCA at step m. This ratio is formed by the explicative level of the gene j divided by the explicative level of the general model for which the gene k, but the other tested genes also potentially have a contribution. This probability is not generally invariant under a transposition of indexes j and k.
G is stochastic because:
1) G is clearly positive.
2) For k ∈ 〚1, r〛, index of a tested gene,
If k is not a tested gene, it is clear that
Now, we describe the matrices and tests used by the algorithm:
If, as suggested for the initialization step, we choose a priori the p first genes (x_{1},…, x_{p}) associated with the row probability vector V^{0} of coordinates v_{k}^{0} = P(X_{0} = k), where ∀k ∈ 〚p + 1, r〛, v_{k}^{0} = 0, we obtain for the first iteration:
1) G_{1}(g_{kj}^{1}) stochastic matrix r × r.
2) V^{1}(v_{k}^{1}) row vector with V^{1} = V^{0}G_{1}.
Thus, V^{1} is a probability vector, where:
The blockwise matrix G_{1}, the vector V^{0} and V^{1}, are represented in the Fig. 3. The square blocks in the main diagonal of G_{1} are from the left to the right of respective dimensions p and r − p.
The coefficient (1 − ρ_{1})/N_{1} being strictly positive, if N_{0}^{1} is a fixed strictly positive threshold level for this step, the equivalence
Let us consider for this step the matrix G_{1}^{0} obtained by replacing in G_{1} every N_{j}^{1} by N_{0}^{1}, where j describes the set of indexes of the p tested genes in the CCA. Then, the exclusion test for genes collects those l_{1} indexes that break the constraint V^{0}G_{1} ≥ V^{0}G_{1}^{0} (term to term inequality for row vectors) because for j index of the p tested genes, if (V)_{j} is the jth coordinate of the row vector V,
Now we suppose 0 < l_{1} ≤ p and p + l_{1} ≤ r. The next step is a CCA in which are removed these l_{1} genes, and then we get the 1_{1} following ones in such a way that p genes are still tested. We chose ρ_{2} ∈ [0, 1[.The CCA is used to build a blockwise stochastic matrix G_{2}(g_{kj}^{2}) where, to simplify the notation without loss of generality, it is assumed here
We obtain the row vector V^{2}(v_{k}^{2}) using V^{2} = V^{1}G_{2}. Thus, V^{2} is a probability vector where
The blockwise matrix G_{2} and the vector V^{2} are represented in Fig. 4. The square blocks in the main diagonal of G_{2} are from the left to the right of respective dimensions p − l_{1}, l_{1}, l_{1}, r − (p + l_{1}).
Let us consider for this step the matrix G_{2}^{0} obtained by replacing in G_{2} every N_{j}^{2} by N_{0}^{2} (fixed threshold level with N_{0}^{2} ≥ N_{0}^{1}), where j describes the set of indexes of the p tested genes in the CCA. Then, the gene exclusion test consists in collecting those l_{2} indexes, which break the constraint V^{1}G_{2} ≥ V^{1}G_{2}^{0} because for j index of the p tested genes,
We assume further 0 < l_{2} ≤ p (the model progresses) and p + ∑_{i=1}^{2} l_{i} ≤ r (it remains genes to be tested).
4.2 Algorithm With the Previous Notations, in a Pseudolanguage
The algorithm as described in the previous section can be associated with a stochastic matrices sequence (G_{n}) of PageRank type according to the following iterative method:
Initialization.
Choice of V^{0} probability row vector/∀k ∈ 〚p + 1, r〛, v_{k}^{0} = 0 and J_{0} = 〚1, p〛.
Step 1:
1). Choice of ρ_{1} ∈ [0, 1[.
2) CCA for {x_{i}}_{i∈J0} and computation of G_{1}.
3) Computation of V^{1} = V^{0}G_{1}. ∀j ∈ 〚1, r〛, (V^{1})_{j} = P(X_{1} = j).
4) Choice of the threshold level N_{0}^{1}.
5) Exclusion test (V^{0}G_{1})_{j} < (V^{0}G_{1}^{0})_{j} for j describing J_{0}.
6) Computation of l_{1} (number of excluded genes).
○ If l_{1} = 0 increment N_{0}^{1} with 1 and return to step 5).
○ Else ⪡ stop testing ⪢
● If p + l_{1} ≤ r, computation of J_{1}.
● Else End.
J_{1} is the set of the p incoming genes into the CCA for the next step. J_{1} is the union of the set of indexes corresponding to the p − l_{1} preserved genes belonging to J_{0} and the set 〚p + 1, p + l_{1}〛 of indexes corresponding to the l_{1} incoming genes.
Step n (n ≥ 2).
1) Choice of ρ_{n} ∈ [0, 1[.
2) CCA for {x_{i}}_{i∈Jn−1} and computation of G_{n}.
3) Computation of V^{n} = V^{n−1}G_{n} = V^{0}G_{1}…G_{n}. ∀j ∈ 〚1, r〛, (V^{n})_{j} = P(X_{n} = j).
4) Choice of the threshold level N_{0}^{n} ≥ N_{0}^{n−1}.
5) Exclusion test (V^{n−1}G_{n})_{j} < (V^{n−1}G_{n}^{0})_{j} for j describing J_{n−1}.
6) Computation of l_{n} (number of excluded genes).
○ If l_{n} = 0 increment N_{0}^{n} with 1 and return to step 5).
○ Else ⪡ stop testing ⪢
● If p + ∑_{i=1}^{n} l_{i} ≤ r, computation of J_{n}.
● Else End.
J_{n} is the set of indexes of the p incoming genes into the CCA for the next step. J_{n} is the union of the set of indexes corresponding to the p − l_{n} preserved genes belonging to J_{n−1} and the set 〚p + ∑_{i=1}^{n}−1 l_{i} + 1, p + ∑_{i=1}^{n} l_{i}〛 of indexes corresponding to the l_{n} incoming genes.
This algorithm is programmed in R language and is available in the Supplementary Data 1.^{1} Just like in our R program, after the last step one can force the introduction of eventually remaining genes (strictly less than p). So, it gives one more step.
4.3 Theoretical Results From the Method
For this section 4.3 only, we will retain the notation N to denote the number of steps processed by the algorithm.
4.3.1 Finiteness of the algorithm.
This algorithm ends after N steps with clearly N ≤ r − p + 1.
4.3.2 Stability result.
The stability of high explicative levels of persistent genes through several connected CCA models is ensured by a demonstrated theorem (available on demand).
4.3.3 Lemma.
As in section 4.1, we consider
Under the constraints N_{0}^{1} = 1 and for every n ≥ 1,
proof.
First of all, this is not so clear that ψ is an application, because the determination of the stochastic matrices G_{1},…, G_{N} also depends on the genes exclusion method. As a result, this determination depends on the calculation of the explicative levels, which must be the same regardless of the algorithm reexecution with the same initial vectors and constraints on the N_{0}^{n} sequence. For this purpose, it is sufficient that the CCA gives the same eigenvectors for the eigensubspaces associated with Bartlettsignificant multiple eigenvalues. Among a various number of iterative methods, one of them is chosen for the determination of eigenvectors by CCA. This method generally begins with initial randomly generated vectors (10, 24). For any reexecution of a CCA, the choice of the initial vectors must be the same. Then the number of steps and the exclusion of genes at each step are the same for any ρ_{1}, ρ_{2},… sequence. So, clearly, the definition set of the application ψ is [0,1[^{N}.
Moreover, if
4.3.4 Selfavoiding walk.
The stochastic matrices (G_{n}) are the result of a nonMarkov process. In fact, the equality
4.3.5 PageRank matrices properties.
Each matrix G_{n}, n ∈ 〚1, N〛 is stochastic by construction and possibly reducible because, with indexes permutation, we can usually put it in the form (_{B}^{A} _{C}^{0}) with A and C square matrices of nonzero dimension. Thus, if G_{n} is reducible, G_{n} is nonergodic. Then, G = G_{1}…G_{N} is still stochastic, possibly reducible, and so nonergodic in this case.
4.3.6 Markov chains, local extended models, and structure matrix.
Now we use the stochastic matrix G = G_{1}…G_{N}. If G is considered as a matrix associated with a constant transition kernel ν(.,.) on every step, we thus define a homogeneous Markov chain. It has been shown in section 4.3.5 that G is generally reducible therefore nonergodic. If we restrict the kernel to the set Ess of essential points, we obtain a partition of Ess into equivalence classes C_{1},…, C_{t} for the relation R said “communication relation.”
For (p,q) ∈ 〚1, r〛 × 〚1, r〛, R is defined by
We call an LEM the set of genes whose indexes belong to the same communication class. According to the KolmogorovDoeblin theorem (28), for a given initial distribution Π_{0} on 〚1, r〛 set of genes indexes, each trajectory of the Markov chain reaches (because our kernel ν has only a finite number of states) an essential point. Then the trajectory stays in a communication class of this point. The transition kernels obtained by restriction of the kernel to classes C_{1},…, C_{t} are always irreducible. ν^{m} formally represents the iterated kernel with the associated matrix G^{m}. If for i ∈ 〚1, t〛, νc_{i} is periodic of period d_{i}, the relation S defined for (p,q) ∈ c_{i} × c_{i} by
Finally, for each i ∈ 〚1, t〛 we obtain an LEM composed of the set of genes whose indexes appear in c_{i}.
ν_{ci} is an irreducible kernel whose matrix thus admits a probability vector Π_{i} invariant giving the influence of each gene in this LEM. If c_{i} is aperiodic, Π_{i} is the limit distribution. The set of LEMs determines blocks of related genes that define a structure matrix, while other methods (GCCA, RGCCA) have difficulties to provide the rationale for its determination.
5 NUMERICAL RESULTS FROM CASE STUDIES
The algorithm was applied to the three datasets (NutriBov, NutriMous, and LiverTox). One is particularly detailed here: the NutriBov study.
5.1 Computing the G matrix and Gene Selection in the NutriBov Study: the Mathematical Interpretation
The algorithm was processed on the NutriBov study with CCA models including six genes (p) and six metabolites (q) at each iteration step. We obtained a diagonal blockwise matrix G that presented 27 square blocks (submatrices), each defining an LEM as described in 4.3.6.
In our example, it was sufficient to compute G^{10} in order to converge each column of each submatrix to a nearly constant column vector (deviation from the mean value less than or equal to 10^{−3}). Each row of a submatrix exponentially converges to an invariant probability vector Π_{∞}. p′ is the number of genes included in an LEM, ρ′ = 1/p′ will be called the threshold of PageRank noise.
Then, a gene is selected in an LEM if it presents a corresponding column mean P verifying P > ρ′. Here, the submatrices of G are ergodic because the corresponding submatrices of G^{10} have all their terms strictly greater than zero. If these submatrices are irreducible but eventually nonergodic, the previous selection criteria uses a Cesaro average convergence (term to term for a sequence of matrices) here applied to the sequence of powers of the matrix G (9).
We avoided local models presenting only six genes, which meant that all the genes where excluded at the next step of the iteration process. A total of 93 genes out of 293 were selected in G^{10} within the 27 LEMs.
5.2 Algorithm Highlights Genes of Interest, Example With One LEM: from Mathematics to Biology
Among the 27 LEMs computed by the LEM method on the NutriBov study, we will comment in this section on the model including the highest number of genes.
Twentyseven genes were associated in this single interesting LEM in the matrix G. We selected 11 genes when computing G^{10} because they were involved with probabilities P > 1/27 (≈0.037). The probabilities of genes association for this model are then summarized in a weighted Markovian graph (see Fig. 5) dedicated to the genes selected. Note that this kind of graph could be computed for any other of the 27 LEMs to highlight the process of gene selection.
The functions highlighted in the genes selected by the LEM method were highly coherent as the biological processes gathered them: mainly cytoskeleton coherence, dynamic and organization, and immunity. Moreover, the LEM method highlighted PLAC8 (Placentaspecific 8) in the oviduct, a gene that regulates embryomaternal interactions. Higher expression of PLAC8 in bovine blastocysts (day 7 embryo postfertilization previously in interface with the oviduct) was reported to be a good marker of pregnancy success (8). This result is of biological interest and is specific to the use of the LEM method, as it did not appear in previous analyses by RCCA (31).
5.3 Biological Results on the Other Studies: NutriMous and LiverTox
Applied to the NutriMous study, the algorithm highlighted correlations between all hepatic fatty acids and 29 genes (out of 120). Applied to the LiverTox dataset, the algorithm highlighted correlations between all clinical measurements and 1,015 genes (out of 3,116). As for the NutriBov study, the algorithm identified genes (correlated to physiological or clinical measurements) that were also identified by RCCA or sPLS analyses (Table 2). Moreover, the LEM method also identified new genes that were “algorithmspecific.” Fortunately, the cellular distribution of all these genes (shared or specific) was fairly similar whatever the method (RCCA, sPLS, or LEM method), suggesting that these computations highlighted different gene correlations inside common pathways. To validate this, we looked at the LiverTox dataset using the Ingenuity Pathway Analysis software (Ingenuity Systems, http://www.ingenuity.com) and confirmed that genes specifically or commonly identified with sPLS or with the algorithm (LEM method): 1) belonged to similar gene networks such as “cell morphology” (see Fig. 6) and 2) helped drawing extended gene networks (Fig. 7 and Fig. 8).
5.4 Global Overview of LEM Method Results Compared With RCCA or sPLS: Advantages and Limits
Benchmarking and time computations.
We performed a benchmarking of the three methods on the different datasets using an Intel Core i72760QM central processing unit (CPU, 2.4 GHz, 4 cores, and 8 processors), according to R version 2.15.3 (20130304, R Foundation for Statistical Computing), to assess their respective performances during realistic simulations on real biological datasets (Table 3). We principally focused on “user time” and “system time,” as the user time is the CPU time charged for the execution of user instructions of the calling process, and the system time is the CPU time charged for execution by the system on behalf of the calling process. We observed a huge difference between regularized methods such as RCCA and nonegularized ones (e.g., LEM method and sPLS). Indeed, both user time and system time were particularly low with LEM or sPLS on moderated size datasets such as NutriBov or NutriMous (maximum 4.626 s), while RCCA needed 26.06 or 23.12 min to complete the process.
Interestingly, the largest dataset, LiverTox, diverged the computation times. The LEM method only took 2.16 h to complete the analysis on the 3,116 genes × 10 clinical measurements × 64 individuals. Comparatively, the search of regularization parameters for RCCA never converges in our conditions on those data, and we estimated the user time >> 11.57 days. Finally, the sPLS only took 1.549 s to estimate the correlations, supporting the fact that sPLS is a direct calculation method (converges to a classical CCA) but is still unable to estimate the coefficients to affect to the structure matrix.
Degree of gene selection.
The LEM method was a quite selective method with recommended selection parameters (section 3.2) in the Nutribov study as it highlighted 93 nonredundant genes (out of 293) vs. 151 genes in the classical model of RCCA (r > 0.6). As the algorithm was built to provide a selection within the transcriptomic dataset, we hope to bring by this exploitation of multiple local models a restrictive set of genes highly related to the whole metabolic context. Those results reach the purpose of biologists who need to find small sets of genes that could be further validated (RTqPCR) or used as markers in candidategene approaches. The levels of gene selection were more similar on the LiverTox and the NutriMous dataset (respectively, 1,015: LEM method vs. 1,032: sPLS and 29: LEM method vs. 27: RCCA).
Mathematical advantages and limits.
The LEM method doesn't need a regularization procedure for high dimensional datasets, aims to maximize the correlations between the genes and the metabolic/clinical measurements, proposes a method for the determination of structure matrices, and is fast while computing in R (25). Moreover, the stability of the method was mathematically demonstrated. However, this method remains a heuristic, i.e., an exploratory method, but proposes new concepts for the search of correlations between two biological datasets (see section 4.3). On the other hand, this method does not detail much about which metabolites/clinical measurements contribute the most to the correlations with the genes for each CCA model.
Biological advantages and limits.
As a result of their mathematical differences, these methods do not reveal exactly the same genes or networks as being correlated to metabolites. These methods could be used independently or complementarily. The complementarities of statistical tools for biological analysis should be considered as a suitable methodology to shed light on more aspects of gene regulations than only one would as proposed in section 5.3.
Figure 9 proposes a summary of advantages and limits proposed by the LEM method.
CONCLUSIONS
Here, we built a heuristic CCAbased algorithm able to make a gene selection within transcriptomic datasets. Genes are selected through metabolic parameters used to define a phenotype of interest. For each CCA model across the iteration process, we computed the explicative level of each gene: a parameter assessing the degree of correlation between a gene and the whole metabolic parameters. We optimized the strategy of research of local models by retaining the genes on the basis of their good explicative levels. This method is equivalent to a selfavoiding walk process that gives a stochastic diagonal blockwise matrix G representing the association probability between genes. We considered that this association between genes is fixed, and we had a homogeneous Markov process, which establishes a partition of related genes, gathered in LEM. G was then elevated to a certain power to converge the probabilities to select genes by eliminating nonsignificant or nonessential genes.
The algorithm was simulated on three datasets of different complexities. The comparison with the established models (RCCA, sPLS) revealed many common genes, which makes us confident in our approach. Moreover, the LEM method helped in the search of new and/or complementary genes networks and was very competitive on high dimensional datasets.
Finally, the metabolites that appear in the selected datasets correspond to blood metabolites, including hormones, or to hepatic fatty acids. However, they could virtually be of any origin since the LEM method does not preclude of the biological variables that are analyzed concomitantly to gene expression changes. The LEM could thus apply to any such variable such as animals, plants, or bacteria, in vivo or in vitro biological systems, at the cellular or subcellular level (mitochondria for example), as long as these variables are measurable, even at a single cell level (22). On the basis of recent studies, when these “nongenic” variables happen to be metabolites, they could be those of interest for animal or vegetal physiology (6, 30), disease (14), or development but also animal or vegetal stem cell fate (20, 26, 33).
GRANTS
This work received the financial support of the National Research Agency of France and APISGENE (GENANIMAL Program). Damien Valour was a PhD student with a DGERINRA grant followed by a oneyear grant thanks to the Reproseq project (INRA/UNCEIA).
DISCLOSURES
No conflicts of interest, financial or otherwise, are declared by the author(s).
AUTHOR CONTRIBUTIONS
Author contributions: D.V., B.G., and B.V. conception and design of research; D.V. performed experiments; D.V., I.H., B.G., and B.V. analyzed data; D.V., I.H., B.G., and B.V. interpreted results of experiments; D.V., I.H., and B.V. prepared figures; D.V., I.H., B.G., and B.V. drafted manuscript; D.V., I.H., B.G., and B.V. edited and revised manuscript; D.V., I.H., B.G., and B.V. approved final version of manuscript.
ACKNOWLEDGMENTS
The authors are glad to thank KimAnh Lê Cao, Ignacio González, Sébastien Déjean, the Editor, and two referees for constructive comments that contributed to improving the quality of the paper. The authors also thank Sébastien Déjean for contributions in the programming.
Footnotes

↵1 The online version of this article has supplemental material.
 Copyright © 2013 the American Physiological Society