Skip to content

First head-to-head against HDF5Array #12

@LTLA

Description

@LTLA

This is a condensed version of a real application involving PCA on sparse log-transformed expression values:

sce <- scRNAseq::MacoskoRetinaData() 
y <- scuttle::normalizeCounts(counts(sce))
dim(y)
## [1] 24658 49300

library(BiocSingular)
library(HDF5Array)
system.time(hdf.mat <- writeHDF5Array(y, filepath="macosko.h5", name="logcounts"))
##    user  system elapsed 
## 144.265   3.220 147.627 
system.time(hdf.pcs <- runPCA(t(hdf.mat), 10, BSPARAM=RandomParam(deferred=TRUE)))
##    user  system elapsed 
## 861.133  57.775 918.967 

library(TileDBArray)
system.time(tdb.mat <- writeTileDBArray(y, path="macosko_tdb", attr="logcounts"))
##    user  system elapsed 
##  66.415   1.717  20.009 
system.time(tdb.pcs <- runPCA(t(tdb.mat), 10, BSPARAM=RandomParam(deferred=TRUE)))
##    user  system elapsed 
## 888.668 167.635 347.845 

Note that this is not quite a fair comparison:

Nonetheless, these results are encouraging given that no effort has been made to optimize the TileDB calls either. For starters, I suspect the tile extents are too small. (Defaults to 100 in each dimension.)

Session information
R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] TileDBArray_0.0.1           HDF5Array_1.17.2           
 [3] rhdf5_2.33.3                BiocSingular_1.5.0         
 [5] scRNAseq_2.3.6              SingleCellExperiment_1.11.5
 [7] SummarizedExperiment_1.19.5 DelayedArray_0.15.5        
 [9] matrixStats_0.56.0          Matrix_1.2-18              
[11] Biobase_2.49.0              GenomicRanges_1.41.5       
[13] GenomeInfoDb_1.25.2         IRanges_2.23.10            
[15] S4Vectors_0.27.12           BiocGenerics_0.35.4        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6                  rsvd_1.0.3                   
 [3] lattice_0.20-41               zoo_1.8-8                    
 [5] assertthat_0.2.1              digest_0.6.25                
 [7] mime_0.9                      BiocFileCache_1.13.0         
 [9] R6_2.4.1                      RSQLite_2.2.0                
[11] httr_1.4.1                    pillar_1.4.4                 
[13] zlibbioc_1.35.0               rlang_0.4.6                  
[15] curl_4.3                      irlba_2.3.3                  
[17] tiledb_0.7.0                  blob_1.2.1                   
[19] BiocParallel_1.23.0           RcppCCTZ_0.2.7               
[21] AnnotationHub_2.21.1          RCurl_1.98-1.2               
[23] bit_1.1-15.2                  shiny_1.4.0.2                
[25] compiler_4.0.0                httpuv_1.5.4                 
[27] base64enc_0.1-3               pkgconfig_2.0.3              
[29] htmltools_0.5.0               tidyselect_1.1.0             
[31] tibble_3.0.1                  GenomeInfoDbData_1.2.3       
[33] interactiveDisplayBase_1.27.5 crayon_1.3.4                 
[35] dplyr_1.0.0                   dbplyr_1.4.4                 
[37] later_1.1.0.1                 rhdf5filters_1.1.0           
[39] bitops_1.0-6                  rappdirs_0.3.1               
[41] grid_4.0.0                    xtable_1.8-4                 
[43] lifecycle_0.2.0               DBI_1.1.0                    
[45] magrittr_1.5                  scuttle_0.99.9               
[47] XVector_0.29.2                promises_1.1.1               
[49] DelayedMatrixStats_1.11.0     ellipsis_0.3.1               
[51] generics_0.0.2                vctrs_0.3.1                  
[53] Rhdf5lib_1.11.2               tools_4.0.0                  
[55] bit64_0.9-7                   nanotime_0.2.4               
[57] glue_1.4.1                    purrr_0.3.4                  
[59] BiocVersion_3.12.0            fastmap_1.0.1                
[61] yaml_2.2.1                    AnnotationDbi_1.51.0         
[63] BiocManager_1.30.10           ExperimentHub_1.15.0         
[65] memoise_1.1.0                

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions