4 Scaling coverage data

In this example we will look at the two scaling approaches that can be utilized in the SIPmg package. The user can decide either

  1. A robust linear regression model
  2. An ordinary least squares linear regression model We will also briefly discuss data filtering methods when there are heavily influencing outliers, as a result of upstream methods.

4.1 Basic approach for scaling

Sequencing effort on each fraction and subsequently obtained coverage is typically used to estimate relative abundance for a feature of interest. However, as we discussed in the introductory vignette, this data is compositional and a quantitative estimate of concentrations can be highly informative. For this purpose, synthetic spike-in standards (sequins) with known concentrations are added before recovery of fractionated DNA, from the CsCl medium into TE buffer or water. For details on sequins please refer to Hardwick et al. (2018). These reference standards in each fraction, can now be used to estimate the concentration of the feature of interest in each fraction.

The approach used for scaling is detailed in the following steps:

  1. For each fraction, sequin coverages and known concentrations are obtained to make a standard curve.

  2. Limit of detection of sequins, i.e., the lowest concentration where at least one sequin has detectable coverage, is determined for each group of sequin concentrations.

  3. For each group of sequin concentrations, mean, standard deviation, and the coefficient of variation of coverage are determined. Groups of sequins with coefficient of variation greater than the set threshold will be flagged.

  4. Sequin groups with coverage values above the limit of detection and below the coefficient of variation threshold are filtered in preparation for regression.

  5. Linear regression or robust linear regression, based on the user choice, is performed on sequin coverage values as the independent variable and concentration of the sequin group as the dependent variable. The user can also decide on performing log scaling of coverage and concentration values for regression.

  6. The regression parameters are extracted and plots with regression parameters and best fit line are saved for inspection.

  7. Coverage values above the limit of detection and below the coefficient of variation threshold are filtered for further analysis. The rest of the values are flagged and reported.

  8. The filtered values are scaled based on regression parameters to estimate absolute concentrations in each library.

  9. Absolute concentrations of biological features are saved as a fresh dataset for the subsequent isotope incorporator identification and AFE estimation pipeline.

4.2 Choice of regression model

Ordinary least squares regression can be very sensitive to outliers, resulting a poor model performance when the data is contaminated with outliers. Although there are methods, like Cook’s distance based filtering to address outliers, it may not always be the best idea to remove these data points.

See these discussions and more for better insights into handling outliers.

Robust linear regression is an alternative to this situation. Robust linear regression assigns appropriate weights to the data points, minimizing the influence of outliers on the model performance, without deleting data. In this pipeline, a Huber loss function from the MASS R package was used for robust linear regression.

The user is free to choose between the two methods for regression.

4.3 Scaling the data

For illustrating the differences in the regression models, a different dataset, that was impacted by one or many upstream methods, will be used than the one in the introductory vignette.

This dataset can be found here

For both scaling functions the following data and parameters are required

  1. Coverage data. The output of checkm coverage command can be directly used.

  2. Sequin dilutions and metadata.

  3. Whether or not log10 scaling to be performed (a BOOLEAN value of TRUE or FALSE).

  4. Coefficient of variation. In this dataset, a higher coefficient of variation compared to the value used in the example workflow of the introduction vignette is used to allow data scaling to occur.

If the robust linear regression is chosen, the function is scale_features_rlm which is performed the function file sequin_scaling_rlm.R.

mag_tab_scaled_rlm <- SIPmg::scale_features_rlm(f_tibble, sequins, seq_dil, log_scale, coe_of_variation = coe_of_variation, save_plots = FALSE)

If a linear regression is chosen, the function is scale_features_lm which is performed by the function file sequin_scaling_lm.R.

If the user chooses linear regression, they can choose to perform outlier filtering based on the traditional Cook’s distance threshold of 4/n (n being the sample size). Only the outliers in the sequin coverage data are flagged as outliers with the Cook’s distance threshold and are filtered out. Later, the remaining data is subjected to linear regression to obtain model parameters and to scale coverage values of feature of interest.

mag_tab_scaled_lm <- SIPmg::scale_features_lm(f_tibble, sequins, seq_dil, log_scale, coe_of_variation = coe_of_variation, cook_filtering = TRUE, save_plots = FALSE)

4.4 Comparison of the fits

Robust linear regression provides a better model fit compared to the linear regression. Additionally, robust linear regression has less variability compared to the linear regression model.

As discussed previously, removing the outliers may not always be the best idea. However, the outliers can be visualized to assess how “far out” are the outliers and how influential the data points are, negatively influencing the linear regression method.

The mag_tab_scaled_lm function provides the plots for the visualization of Cook’s distance for the sequin data in each fraction. In the plot below, an outlier is visualized. For this data, the Cook’s distance threshold was 0.09, and the outlier had a Cook’s distance of 10.8. The highly influential data point could well be the reason the linear regression model had a poor fit.

#Cook's distance threshold of the data set
4/(length(mag_tab_scaled_lm$scale_fac$cooksd[[3]]))
## [1] 0.1142857
#Cook's distance of the outlier before filtering the data
max(mag_tab_scaled_lm$scale_fac$cooksd[[3]])
## [1] 0.3588489
#Before filtration
cooksd_example = knitr::include_graphics("cooksd-example.png")

Upon filtering the data to remove the outlier, the fit becomes much better for the linear regression model

#Linear regression plot with filtered outliers in sequin data
filtered_lm_example = knitr::include_graphics("filtered_lm-example.png")

4.5 What if there were no sequins added in the study

In our study, we realized that the sequin-scaled data provided the best correlation (Spearman correlation coefficient = 0.85), compared to methods without the use of sequins. However, not all studies would have access to sequins or could choose not to add sequins. We tested qSIP methods with absolute concentration of MAGs obtained as a product of fraction DNA concentration and either with MAG relative abundance (as performed by Greenlon et al.) or relative coverage (as performed by Starr et al.), or simply with relative coverage, i.e., without converting the MAG coverages to absolute concentration. We found that without sequins, simply using relative coverage provided better specificity (0.99) and good correlation (Spearman correlation coefficient = 0.76), compared to the other two methods. The user could choose any of the three approaches to obtain normalized coverages using the equation coverage_normalization() where the parameter approach would be chosen accordingly (either “greenlon”, “starr”, or “relative_coverage”). In this example, simple relative coverage will be estimated without converting to absolute concentration based on fraction DNA concentrations. The relative coverage table can then be converted to a phyloseq object to run incorporator identification and/or AFE estimation. For more details, please refer to the incorporator identification section

f_tibble <- readr::read_csv("mock_input_data/coverage_metadata.csv")
rel.cov = SIPmg::coverage_normalization(f_tibble = f_tibble, approach = "relative_coverage")
mag.table = phyloseq::otu_table(as.matrix(rel.cov %>% tibble::column_to_rownames(var = "Feature")), taxa_are_rows = TRUE) #Phyloseq OTU table

4.6 Decision on the method to scale coverages

It can be seen from the above plots that the regression parameters differ based on the method of regression and whether or not data filtering is used. These regression parameters directly influence the estimation of concentrations of features of interest in the microbiome. Thus, the method for regression must be a careful choice.

sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.utf8  LC_CTYPE=English_Canada.utf8   
## [3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] SIPmg_1.4           ggpubr_0.5.0        HTSSIP_1.4.1       
##  [4] phyloseq_1.42.0     forcats_0.5.2       stringr_1.5.0      
##  [7] dplyr_1.0.10        purrr_0.3.5         readr_2.1.3        
## [10] tidyr_1.2.1         tibble_3.1.8        ggplot2_3.4.0      
## [13] tidyverse_1.3.2     BiocManager_1.30.19
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.4.1                backports_1.4.1            
##   [3] plyr_1.8.8                  igraph_1.3.5               
##   [5] lazyeval_0.2.2              splines_4.2.2              
##   [7] BiocParallel_1.32.4         GenomeInfoDb_1.34.4        
##   [9] digest_0.6.30               foreach_1.5.2              
##  [11] htmltools_0.5.4             fansi_1.0.3                
##  [13] magrittr_2.0.3              memoise_2.0.1              
##  [15] googlesheets4_1.0.1         cluster_2.1.4              
##  [17] tzdb_0.3.0                  Biostrings_2.66.0          
##  [19] annotate_1.76.0             modelr_0.1.10              
##  [21] matrixStats_0.63.0          vroom_1.6.0                
##  [23] timechange_0.1.1            colorspace_2.0-3           
##  [25] blob_1.2.3                  rvest_1.0.3                
##  [27] haven_2.5.1                 xfun_0.35                  
##  [29] crayon_1.5.2                RCurl_1.98-1.9             
##  [31] jsonlite_1.8.4              survival_3.4-0             
##  [33] iterators_1.0.14            ape_5.6-2                  
##  [35] glue_1.6.2                  gtable_0.3.1               
##  [37] gargle_1.2.1                zlibbioc_1.44.0            
##  [39] XVector_0.38.0              DelayedArray_0.24.0        
##  [41] car_3.1-1                   Rhdf5lib_1.20.0            
##  [43] BiocGenerics_0.44.0         abind_1.4-5                
##  [45] scales_1.2.1                DBI_1.1.3                  
##  [47] rstatix_0.7.1               Rcpp_1.0.9                 
##  [49] xtable_1.8-4                bit_4.0.5                  
##  [51] stats4_4.2.2                httr_1.4.4                 
##  [53] RColorBrewer_1.1-3          ellipsis_0.3.2             
##  [55] farver_2.1.1                pkgconfig_2.0.3            
##  [57] XML_3.99-0.13               sass_0.4.4                 
##  [59] dbplyr_2.2.1                locfit_1.5-9.6             
##  [61] utf8_1.2.2                  labeling_0.4.2             
##  [63] polynom_1.4-1               tidyselect_1.2.0           
##  [65] rlang_1.0.6                 reshape2_1.4.4             
##  [67] AnnotationDbi_1.60.0        munsell_0.5.0              
##  [69] cellranger_1.1.0            tools_4.2.2                
##  [71] cachem_1.0.6                cli_3.4.1                  
##  [73] generics_0.1.3              RSQLite_2.2.19             
##  [75] ade4_1.7-20                 broom_1.0.1                
##  [77] evaluate_0.19               biomformat_1.26.0          
##  [79] fastmap_1.1.0               yaml_2.3.6                 
##  [81] knitr_1.41                  bit64_4.0.5                
##  [83] fs_1.5.2                    KEGGREST_1.38.0            
##  [85] nlme_3.1-160                xml2_1.3.3                 
##  [87] compiler_4.2.2              rstudioapi_0.14            
##  [89] png_0.1-8                   ggsignif_0.6.4             
##  [91] reprex_2.0.2                geneplotter_1.76.0         
##  [93] bslib_0.4.2                 stringi_1.7.8              
##  [95] highr_0.10                  lattice_0.20-45            
##  [97] Matrix_1.5-3                vegan_2.6-4                
##  [99] permute_0.9-7               multtest_2.54.0            
## [101] vctrs_0.5.1                 pillar_1.8.1               
## [103] lifecycle_1.0.3             rhdf5filters_1.10.0        
## [105] jquerylib_0.1.4             data.table_1.14.6          
## [107] bitops_1.0-7                GenomicRanges_1.50.1       
## [109] R6_2.5.1                    bookdown_0.31              
## [111] IRanges_2.32.0              codetools_0.2-18           
## [113] MASS_7.3-58.1               assertthat_0.2.1           
## [115] rhdf5_2.42.0                SummarizedExperiment_1.28.0
## [117] DESeq2_1.38.1               withr_2.5.0                
## [119] S4Vectors_0.36.1            GenomeInfoDbData_1.2.9     
## [121] mgcv_1.8-41                 parallel_4.2.2             
## [123] hms_1.1.2                   grid_4.2.2                 
## [125] rmarkdown_2.19              MatrixGenerics_1.10.0      
## [127] carData_3.0-5               googledrive_2.0.0          
## [129] Biobase_2.58.0              lubridate_1.9.0