Below we demonstrate an end-to-end basic CITE-seq analysis starting from UMI count alignment output files from Cell Ranger. Standard output files from Cell Ranger are perfectly set up to use dsb. Our method is also compatible with any alignment tool; see: using other alignment tools. We load unfiltered UMI data containing cells and empty droplets, perform QC on cells and background droplets, normalize with dsb, and demonstrate protein-based clustering and multimodal RNA+Protein joint clustering using dsb normalized values with Seurat’s Weighted Nearest Neighbor method.

Please cite the dsb manuscript if you used our software or found the experimental and modeling derivation of ADT noise sources in our paper helpful:
dsb manuscript

Recent publications using the dsb method
Recent Publications

Table of Contents

  1. Background and motivation
  2. Installation and quick overview
  3. Download public 10X Genomics data
  4. Step 1 A note on alignment of ADTs
  5. Step 2 Load RNA and ADT data and define droplet quality control metadata
  6. Step 3 Quality control on cell-containing and background droplets
  7. Step 4 Normalize protein data with the DSBNormalizeProtein Function
  8. Integrating dsb with Seurat
  9. Clustering cells based on dsb normalized protein using Seurat
  10. dsb derived cluster interpretation
  11. Weighted Nearest Neighbor multimodal clustering using dsb normalized values with Seurat

Topics covered in other vignettes on CRAN: Integrating dsb with Bioconductor, integrating dsb with python/Scanpy, Using dsb with data lacking isotype controls, integrating dsb with sample multiplexing experiments, using dsb on data with multiple batches, advanced usage - using a different scale / standardization based on empty droplet levels, returning internal stats used by dsb, outlier clipping with the quantile.clipping argument, other FAQ.

Background and motivation

Protein data derived from sequencing antibody derived tags (ADTs) in CITE-seq and other related assays has substantial background noise. Our paper outlines experiments and analysis designed to dissect sources of noise in ADT data we used to developed our method. We found all experiments measuring ADTs capture protein-specific background noise because ADT reads in empty / background drops (outnumbering cell-containing droplets > 10-fold in all experiments) were highly concordant with ADT levels in unstained spike-in cells. We therefore utilize background droplets which capture the ambient component of protein background noise to correct values in cells. We also remove technical cell-to-cell variations by defining each cell’s dsb “technical component”, a conservative adjustment factor derived by combining isotype control levels with each cell’s specific background level fitted with a single cell model.

Installation and quick overview

The method is carried out in a single step with a call to the DSBNormalizeProtein() function.
cells_citeseq_mtx - a raw ADT count matrix empty_drop_citeseq_mtx - a raw ADT count matrix from non-cell containing empty / background droplets.
denoise.counts = TRUE - implement step II to define and remove the ‘technical component’ of each cell’s protein library.
use.isotype.control = TRUE - include isotype controls in the modeled dsb technical component.

Download public 10X Genomics data

Download BOTH the Feature / cell matrix (filtered) and Feature / cell matrix (raw) count matrices here: public 10X Genomics CITE-seq data.

Step 1 A note on alignment of ADTs

Download public 10X Genomics data

Download BOTH the Feature / cell matrix (filtered) and Feature / cell matrix (raw) count matrices here: public 10X Genomics CITE-seq data.

To use dsb, we will load ADT and RNA data from cells and empty droplet aligned by Cell Ranger. dsb is also compatible with other aligners see: how to use Cite-Seq-Count and Kallisto with dsb.

Step 1 A note on alignment of ADTs

The Cell Ranger raw_feature_bc_matrix includes every possible cell barcode (columns) x genes / ADT (rows); about 7 Million barcodes for the V3 assay. A very small subset of those columns in the raw_feature_bc_matrix contain cells–those are separately provided in the filtered_feature_bc_matrix file. Another subset of the raw_feature_bc_matrix contain empty droplets capturing ambient antibodies. Instead of discarding that valuable sequencing data, we extract empty droplets capturing ADT using the code below and use them with dsb. Also please see note on Cell Ranger filtered output.

Here is a visual showing for the workflow we will follow starting from alignment and proceeding through QC and normalization

Step 2 Load RNA and ADT data and define droplet quality control metadata

Here we use the convenience function from Seurat Read10X which will automatically detect multiple assays and create two element list Gene Expression and Antibody Capture.

Now calculate some standard meta data for cells that we will use for quality control using standard approaches used in scRNAseq analysis pipelines.

we now have 103,075 barcodes in md with RNA and protein data only about 7,000 contain cells. With dsb we use a majority of this data instead of discarding everything except the cells.

Step 3 Quality control cells and background droplets

To narrow in on the major background droplet population, we visualize the log library size (total reads) of the mRNA and protein. The first two plots from the left help us identify low quality cells (high MT gene content) in both background droplets and the cell droplets. The right two plots show the density of points; yellow areas have more droplets. The major background distribution is clear in the first plot from the left (the yellow area of high density). In the third plot from the left, background drops are colored MT gene proportion so we can see some may be low-quality apoptotic cells. We next set upper and lower cutoffs on the background matrix library size. Note that as shown in Supplementary Fig 7 of our paper, Normalized values are robust to background thresholds used, so long as one does not omit the major background population present in all datasets.

More than 70,000 empty droplets are included after QC.

We next quality control cells in a similar manner with additional quality control metrics calculated as in any standard scRNAseq analysis, e.g. see Luecken et. al. 2019 Mol Syst Biol.

Check: are the number of cells passing QC in line with the expected recovery from the experiment?

[1] 4096

Yes. After quality control above we have 4096 cells which is in line with the ~5000 cells loaded in this experiment.

Now subset the metadata ADT and RNA matrices.

Optional step; remove proteins without staining

Proteins without raw data in stained cells (below, the maximum UMI count for one protein was 4, an order of magnitude below even the isotype controls) can be removed from both matrices prior to normalization. In many cases, removing proteins is not necessary, but we recommend checking your raw data.

prot pmax
CD34_TotalSeqB 4
CD80_TotalSeqB 60
CD274_TotalSeqB 75
IgG2b_control_TotalSeqB 90

Step 4 Normalize protein data with the DSBNormalizeProtein Function

We are now ready to use dsb to normalize and denoise the ADTs. We normalize the raw ADT matrix before creating a Seurat object since we also use the empty droplets.

The method is carried out in a single step with a call to the DSBNormalizeProtein() function.
cells_citeseq_mtx - the raw ADT UMI count matrix containing cells
empty_drop_citeseq_mtx - the raw ADT UMI count matrix containing background droplets
denoise.counts - we set to TRUE (the recommended default) to define and remove technical cell to cell variations. - a vector of the isotype control antibodies from the rownames of the raw ADT matrix defined from rownames(cell.adt.raw) (see vignettes for data without isotype controls). For data without isotype controls, see the vignette section Using dsb with data lacking isotype controls.

The function returns a matrix of normalized protein values which can be integrated with any single cell analysis software. We provide an example with Seurat below.

Advanced users may want to examine internal stats used in dsb, in that case use return.stats = TRUE. If the range of values after normalization is large, this is often due to a few low or high magnitude outliers. The simplest way to address these outliers is to clip the maximum values by setting quantile.clipping = TRUE. Finally a different pseudocount can be used with define.pseudocount = TRUE and pseudocount.use. Please see the vignettes on CRAN to learn more about different parameters that can be used in dsb

Integrating dsb with Seurat

Create a Seurat object. Make sure to add the dsb normalized matrix cell.adt.dsb to the data slot, not the counts slot.

Clustering cells based on dsb normalized protein using Seurat

Now we cluster cells based on dsb normalized protein levels. Similar to workflow used in our paper Kotliarov et al. 2020 we don’t cluster based on principal components from ADT, instead directly using the normalized values.

To see if we recovered the expected cell populations, it is often more informative and interpretable to first look at the summarized (median or mean) protein expression in each cluster–we recommend doing this before trying to look at cells in 2-d visualization plots like umap.

dsb derived cluster interpretation

dsb values are interpretable as the number of standard deviations of each protein from the expected noise (if using the default settings) with additional correction for cell to cell technical variations. One can use this to set a threshold across all proteins for positivity, e.g. expression below 3 or 4 can be deemed unexpressed. If normalized with the same parameters from dsb, the same threshold can be applied across multiple datasets.

Now we can cell type based on median dsb normalized protein level per cluster.

Weighted Nearest Neighbor multimodal clustering using dsb normalized values with Seurat

Below we demonstrate using Seurat’s Weighted Nearest Neighbor multimodal clustering method with dsb normalized values as input for the protein assay. WNN is an excellent way to find fine-grained cell states, especially in larger datasets than this small example data selected for purposes of demonstration.

Below we demonstrate WNN using dsb normalized and denoised ADT data using the default Seurat implementation by first reducing dimensionality of the protein data with PCA.

Method 2 (experimental) – Seurat WNN with dsb normalized protein directly without PCA

Below we show a version of WNN where we directly use normalized protein values without PCA compression. We have found this procedure works well for smaller (e.g. less than 30) protein panels, whereas datasets with many cells generated with recently available pre-titrated panels consisting of more than 100 or 200 proteins may benefit more from dimensionality reduction with PCA.

Now we define marker genes for each cluster and create an aggregated dataframe for visualization

With the aggregated data above we can make a joint heatmap of protein expression and mRNA expression. Below we visualize protein levels on the dsb scale that we used to cluster the cells.