Introduction to clustermole

Alternative title: blindly digging for cell types in scRNA-seq clusters with clustermole

Overview

A typical computational pipeline to process single-cell RNA sequencing (scRNA-seq) data involves clustering of cells. Assignment of cell type labels to those clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search. This is especially challenging if you are not familiar with all the captured subpopulations or have unexpected contaminants. clustermole is an R package that provides a comprehensive meta collection of cell identity markers for thousands of human and mouse cell types sourced from a variety of databases as well as methods to query them.

The clustermole package includes three primary features:

Usage

Install clustermole if it is not yet available on your system.

install.packages("clustermole")

Load clustermole.

library(clustermole)

Overlap a set of genes with cell type markers

If you have a set of genes (for example, cluster markers), you can perform overrepresentation analysis to see if they overlap any of the known cell type markers.

Determine relative enrichment of cell type markers in the input expression data

If you have a table of expression values (for example, average expression across clusters), you can perform cell type enrichment based on a given gene expression matrix (log-transformed CPM/TPM/FPKM values).

Retrieve cell type markers

You can retrieve a data frame of all cell type markers in the database.

Each row contains a gene and a cell type associated with it. The gene column is the gene symbol (human or mouse versions can be retrieved) and the celltype_full column contains the full cell type string, including the species and the original database.

If you need to convert the markers from a data frame to a list format for other applications, you can use gene as the values and celltype_full as the grouping variable.

Collection details

We will use dplyr to help with summary statistics.

library(dplyr)

Retrieve a data frame of all cell type markers in the database.

markers = clustermole_markers(species = "hs")
markers
#> # A tibble: 163,509 x 8
#>    db     species organ  celltype  celltype_full     n_genes gene_original gene 
#>    <chr>  <chr>   <chr>  <chr>     <chr>               <int> <chr>         <chr>
#>  1 CellM… Human   Embryo 1-cell s… 1-cell stage cel…      45 ACCSL         ACCSL
#>  2 CellM… Human   Embryo 1-cell s… 1-cell stage cel…      45 ACVR1B        ACVR…
#>  3 CellM… Human   Embryo 1-cell s… 1-cell stage cel…      45 ARHGEF16      ARHG…
#>  4 CellM… Human   Embryo 1-cell s… 1-cell stage cel…      45 ASF1B         ASF1B
#>  5 CellM… Human   Embryo 1-cell s… 1-cell stage cel…      45 BCL2L10       BCL2…
#>  6 CellM… Human   Embryo 1-cell s… 1-cell stage cel…      45 BLCAP         BLCAP
#>  7 CellM… Human   Embryo 1-cell s… 1-cell stage cel…      45 BNIP1         BNIP1
#>  8 CellM… Human   Embryo 1-cell s… 1-cell stage cel…      45 C1orf210      C1or…
#>  9 CellM… Human   Embryo 1-cell s… 1-cell stage cel…      45 C1orf226      C1or…
#> 10 CellM… Human   Embryo 1-cell s… 1-cell stage cel…      45 CASC3         CASC3
#> # … with 163,499 more rows

Check the number of available cell types.

markers %>% distinct(celltype_full) %>% nrow()
#> [1] 2563

Check the number of available cell types per species (not available for every cell type).

markers %>% distinct(celltype_full, species) %>% count(species, sort = TRUE)
#> # A tibble: 3 x 2
#>   species     n
#>   <chr>   <int>
#> 1 Human    1618
#> 2 Mouse     730
#> 3 ""        215

Check the number of available cell types per organ (not available for every cell type).

markers %>% distinct(celltype_full, organ) %>% count(organ, sort = TRUE)
#> # A tibble: 117 x 2
#>    organ                      n
#>    <chr>                  <int>
#>  1 ""                      1282
#>  2 Brain                    127
#>  3 Central Nervous System    88
#>  4 Digestive System          63
#>  5 Kidney                    56
#>  6 Lung                      52
#>  7 Bone marrow               51
#>  8 Immune system             50
#>  9 Peripheral blood          46
#> 10 Hematopoietic system      44
#> # … with 107 more rows

Check package version.

packageVersion("clustermole")
#> [1] '1.0.0'