-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
This is a condensed version of a real application involving PCA on sparse log-transformed expression values:
sce <- scRNAseq::MacoskoRetinaData()
y <- scuttle::normalizeCounts(counts(sce))
dim(y)
## [1] 24658 49300
library(BiocSingular)
library(HDF5Array)
system.time(hdf.mat <- writeHDF5Array(y, filepath="macosko.h5", name="logcounts"))
## user system elapsed
## 144.265 3.220 147.627
system.time(hdf.pcs <- runPCA(t(hdf.mat), 10, BSPARAM=RandomParam(deferred=TRUE)))
## user system elapsed
## 861.133 57.775 918.967
library(TileDBArray)
system.time(tdb.mat <- writeTileDBArray(y, path="macosko_tdb", attr="logcounts"))
## user system elapsed
## 66.415 1.717 20.009
system.time(tdb.pcs <- runPCA(t(tdb.mat), 10, BSPARAM=RandomParam(deferred=TRUE)))
## user system elapsed
## 888.668 167.635 347.845 Note that this is not quite a fair comparison:
- HDF5 library read/writes are single-threaded, while the TileDB library will happily use multiple cores.
- HDF5Array write for sparse matrices is currently rather inefficient, see write_block fails for SparseArraySeed Bioconductor/HDF5Array#30.
Nonetheless, these results are encouraging given that no effort has been made to optimize the TileDB calls either. For starters, I suspect the tile extents are too small. (Defaults to 100 in each dimension.)
Session information
R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS: /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] TileDBArray_0.0.1 HDF5Array_1.17.2
[3] rhdf5_2.33.3 BiocSingular_1.5.0
[5] scRNAseq_2.3.6 SingleCellExperiment_1.11.5
[7] SummarizedExperiment_1.19.5 DelayedArray_0.15.5
[9] matrixStats_0.56.0 Matrix_1.2-18
[11] Biobase_2.49.0 GenomicRanges_1.41.5
[13] GenomeInfoDb_1.25.2 IRanges_2.23.10
[15] S4Vectors_0.27.12 BiocGenerics_0.35.4
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 rsvd_1.0.3
[3] lattice_0.20-41 zoo_1.8-8
[5] assertthat_0.2.1 digest_0.6.25
[7] mime_0.9 BiocFileCache_1.13.0
[9] R6_2.4.1 RSQLite_2.2.0
[11] httr_1.4.1 pillar_1.4.4
[13] zlibbioc_1.35.0 rlang_0.4.6
[15] curl_4.3 irlba_2.3.3
[17] tiledb_0.7.0 blob_1.2.1
[19] BiocParallel_1.23.0 RcppCCTZ_0.2.7
[21] AnnotationHub_2.21.1 RCurl_1.98-1.2
[23] bit_1.1-15.2 shiny_1.4.0.2
[25] compiler_4.0.0 httpuv_1.5.4
[27] base64enc_0.1-3 pkgconfig_2.0.3
[29] htmltools_0.5.0 tidyselect_1.1.0
[31] tibble_3.0.1 GenomeInfoDbData_1.2.3
[33] interactiveDisplayBase_1.27.5 crayon_1.3.4
[35] dplyr_1.0.0 dbplyr_1.4.4
[37] later_1.1.0.1 rhdf5filters_1.1.0
[39] bitops_1.0-6 rappdirs_0.3.1
[41] grid_4.0.0 xtable_1.8-4
[43] lifecycle_0.2.0 DBI_1.1.0
[45] magrittr_1.5 scuttle_0.99.9
[47] XVector_0.29.2 promises_1.1.1
[49] DelayedMatrixStats_1.11.0 ellipsis_0.3.1
[51] generics_0.0.2 vctrs_0.3.1
[53] Rhdf5lib_1.11.2 tools_4.0.0
[55] bit64_0.9-7 nanotime_0.2.4
[57] glue_1.4.1 purrr_0.3.4
[59] BiocVersion_3.12.0 fastmap_1.0.1
[61] yaml_2.2.1 AnnotationDbi_1.51.0
[63] BiocManager_1.30.10 ExperimentHub_1.15.0
[65] memoise_1.1.0
eddelbuettel, Shians and gladkia
Metadata
Metadata
Assignees
Labels
No labels