**Website of the project with examples and
tutorials:**

https://malucalle.github.io/coda4microbiome/

Understanding the role of the microbiome in human health and how it
can be modulated is becoming increasingly relevant for preventive
medicine and for the medical management of chronic diseases (Calle
2019). High-throughput sequencing technologies has boosted microbiome
research but the **compositional** nature of microbiome
data is a major challenge for their analysis.

Microbiome count data is **compositional** since their
total are constrained by the sequencing depth. Relative abundances
(proportions) are obviously constraint by a sum equal to one. This total
constraint induces strong dependencies among the observed abundances of
the different taxa. In fact, nor the absolute abundance (read counts)
nor the relative abundance (proportion) of one taxon alone are
informative of the real abundance of the taxon in the environment.
Instead, they provide information on the relative measure of abundance
when compared to the abundance of other taxa in the same sample.

We introduce a new package, **coda4microbiome**, that
aims to bridge the gap between microbiome research and compositional
data analysis (CoDA).

Our package provides a set of functions to explore and study
microbiome data within the CoDA framework, with a special focus on
identification of **microbial signatures** that can serve
as biomarkers of disease risk and prognostic. Their prediction accuracy
relies on the selection of the taxa that constitute the signature, which
is challenging given the sparsity, multivariate and
**compositional** inherent characteristics of microbiome
data (Susin et al. 2020).

**coda4microbiome** performs variable selection through
penalized regression in **cross-sectional studies**, with
both binary and continuous outcome. In addition, the package
incorporates a new approach for the analysis of **longitudinal
microbiome studies** with a binary outcome.

Penalized regression implementation relies on the function
`cv.glmnet()`

from the R package **glmnet**
(Friedman et al. 2010) adapted to CoDA by using all pairwise log-ratios
of the variables (Bates and Tibshirani, 2018). The results are expressed
as the (weighted) balance between two groups of taxa, those that
contribute positively to the microbial signature and those that
contribute negatively (Susin et al. 2020).

The interpretability of results is of major importance in this context. The package provides several graphical representations for a better interpretation of the analysis and the identified microbial signatures.

**coda_glmnet**: Identification of microbial signatures in cross-sectional studies.The algorithm performs variable selection through penalized regression on the set of all pairwise log-ratios for both, binary outcome (logistic regression) and continuous outcome (linear regression). The result is expressed as the (weighted) balance between two groups of taxa. It allows the use of non-compositional covariates.

A graphical representation of the taxa that constitutes the signature and their coefficients is provided. The output also includes a plot of the prediction or classification accuracy.

**coda_glmnet_longitudinal**: Identification of microbial signatures in longitudinal studies.Identification of a set of microbial taxa whose joint dynamics is associated with the phenotype of interest (binary). The algorithm performs variable selection through penalized regression over the summary of the log-ratio trajectories (AUC). The result is expressed as the (weighted) balance between two groups of taxa.

The output provides three plots: the taxa that constitutes the signature and their coefficients, the classification accuracy of the signature and the plot of the signature trajectories of the individuals.

Previously or independently of variable selection for microbial signature identification, one may be interested in the exploratory analysis of pairwise log-ratios.

The interpretation of results of log-ratio analysis is challenging because when one taxon A is highly associated with the outcome, any log-ratio involving taxon A is likely to be associated with Y, no matter which is the second taxon involved in the log-ratio. Here we summarize the importance of each taxon A by aggregating the prediction accuracy of all log-ratios that involve taxon A.

**explore_logratios**: Explores the association of each log-ratio with the outcome.**explore_lr_longitudinal**: Explores the association of a summary (integral) of each log-ratio trajectory with the outcome.Both functions summarize the importance of each taxon as the aggregation of the association measures of those log-ratios involving the taxon. The output includes a plot of the association of the log-ratio with the outcome where the taxa are ranked by importance

**explore_zeros**:Provides the proportion of zeros for a pair of variables (taxa) in table x and the proportion of samples with zero in both variables. A bar plot with this information is also provided. Results can be stratified by a categorical variable.

**impute_zeros**:Simple imputation: When the abundance table contains zeros, a positive value is added to all the values in the table. It adds 1 when the minimum of table is larger than 1 (i.e. tables of counts) or it adds half of the minimum value of the table, otherwise.

**logratios_matrix**:Computes the matrix with of all pairwise log-ratios between taxa

**plot_prediction**:Plot of the predictions of a fitted model (microbial signature): Multiple box-plot and density plots for binary outcomes and Regression plot for continuous outcome

**plot_signature**:Graphical representation of the variables selected and their coefficients

**plot_signature_curves**:Graphical representation of the signature trajectories

**coda_glmnet_null**: Performs a permutational test for the`coda_glmnet()`

algorithmIt provides the distribution of results under the null hypothesis by implementing the

`coda_glmnet()`

on different rearrangements of the response variable.**filter_longitudinal**: Filters those individuals and taxa with enough longitudinal information**coda_glmnet_longitudinal_null**: Performs a permutational test for the`coda_glmnet_longitudinal()`

algorithmIt provides the distribution of results under the null hypothesis by implementing the

`coda_glmnet_longitudinal()`

on different rearrangements of the response variable.**shannon**: Shannon information**shannon_effnum**: Shannon effective number of variables in a composition**shannon_sim**: Shannon similarity between two compositions

**References**

Bates S and Tibshirani R (2019) Log-ratio lasso: Scalable, sparse estimation for log-ratio models. Biometrics 75(2):613-624.

Calle ML (2019) Statistical Analysis of Metagenomics Data. Genomics & Informatics 17 (1)

Friedman J, Hastie T, Tibshirani R (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, 33(1), 1–22.

Susin A., Wang Y, Lê Cao K-A, Calle M.L. (2020) Variable selection in microbiome compositional data analysis. NAR Genomics and Bioinformatics, 2 (2)