Step 7: Cluster annotation

In Step 7, cluster annotation is performed to define the cell types comprising the clusters identified in Step 6. ScRNAbox provides three tools to identify cell types comprising the clusters:

Tool 1: Cluster marker gene identification and gene set enrichment analysis (GSEA)
Seurat's FindAllMarkers function is used to identify differentially expressed marker genes (DEG) by the Wilcoxon rank-sum test (Macosko et al. 2015). DEGs in the positive direction (Log2 fold-change > 0.00) are then tested for enrichment across user-defined gene set libraries that define cell types using the EnrichR tool (Chen et al. 2013).

Tool 2: Expression profiling of cell type markers and module scores
Users can visualize the expression of individual genes and the aggregated expression of multiple genes. For each gene in a user-defined list, plots are produced to visualize its expression at the cluster or cell level. The aggregated expression of genes in a user-defined list are calculated using the Seurat AddModuleScore function (Tirosh et al. 2016.

Tool 3: Cell type predictions based on reference data Seurat's FindTransferAnchors and TransferData functions are used to leverage cell-type annotations from a reference Seurat object and generate annotation predictions for the query dataset (Butlet et al. 2019).

Additionally, users can add cluster annotations to the Seurat object.


The following parameters are adjustable for Step 7 (~/working_directory/job_info/parameters/step7_par.txt):

Annotation tool Parameter Default Description
General par_save_RNA Yes Whether or not to export an RNA expression matrix
General par_save_metadata Yes Whether or not to export a metadata dataframe
General par_seurat_object NULL If users already have a Seurat object, they may provide the path to the Seurat object to initiate the pipeline at Step 7
General par_level_cluster integrated_snn_res.0.75 The cluster resolution that you want to annotate. If you skipped integration in Step 5, use par_level_cluster='RNA_snn_res.0.7', if you want to proceed with a clustering resolution of 0.7.
Tool 1 par_run_find_marker Yes Whether or not to find marker genes for each cluster
Tool 1 par_run_enrichR No Whether or not to run gene set enrichment analysis (GSEA) on the marker genes for each cluster using the EnrichR tools. Note that the HPC must have access to the internet to run GSEA.
Tool 1 par_top_sel 5 Number of top markers to identify based on avg_log2FC
Tool 1 par_db Descartes_Cell_Types_and_Tissue_2021,
CellMarker_Augmented_2021,
Azimuth_Cell_Types_2021
Character vector of EnrichR databases that define cell types. The top marker genes for each cluster will be tested for enrichment across these databases.
Tool 2 par_run_module_score Yes Whether or not to compute module score for aggregated expression
Tool 2 par_run_visualize_markers Yes Whether or not to visualize the expression of individual genes
Tool 2 par_module_score NULL Path to the csv file containing the gene sets for the module score
Tool 2 par_select_features_list NULL List of genes whose expression will be visualized individually
Tool 2 par_select_features_csv NULL If you want to define multiple lists of features to visualize individually, you can do so with a csv file. The header should contain the list names and all features belonging to the same list should be in the same column.
Tool 3 par_reference NULL Path defining the location of the reference Seurat object
Tool 3 par_reference_name Reference An arbitrary name for the reference object. This will be used to name the metadata slot.
Tool 3 par_level_celltype NULL The name of the metadata column in the reference Seurat object that defines cell types
Tool 3 par_FindTransferAnchors_dim 50 Number of dimensions from linear dimensional reduction used to find transfer anchors between the reference and query Seurat objects
Tool 3 par_futureglobalsmaxSize 60000 * 1024^2 This will increase your RAM usage so set this number mindfully
Annotate par_annotate_resolution integrated_snn_res.0.75 Which clustering resolution you want to annotate
Annotate par_name_metadata Celltypes1 The name of the metadata slot that will contain the annotations
Annotate par_annotate_labels NULL A list of cluster labels. There must as many labels as clusters at the defined clustering resolution. Please refrain from using "_" when annotating.

Tool 1: Cluster marker GSEA

To run cluster marker GSEA, use the following command:

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 7 \
--markergsea T

The resulting output files are deposited into ~/working_directory/step7/*/marker. For a description of the outputs see here.


Note: In order to test the cluster marker genes for enrichment across EnrichR libraries, the HPC must have access to the internet. If your HPC cannot access the internet, you must set par_run_enrichR = "no". The pipeline will still run differential gene expression and find the markers for each cluster. You can then take the pipeline output and run the enrichment step on your local machine directly in R. To do so, begin by downloading the ClusterMarkers.csv file obtained from running the above command with par_run_find_marker = "yes" to your computer:

scp username@beluga.computecanada.ca:~/working_directory/step7/info7/marker/ClusterMarkers.csv ~/Desktop/working_directory

Then run the following code in R:

#set up the environment
library(Seurat)
library(dplyr)
library(ggplot2)
library(enrichR)


## set up the parameters
PWD <-'/path/directory/where/outputs/will/be/deposited'
cluster_marker <- '/path/to/ClusterMarkers.csv'
db <- c('Descartes_Cell_Types_and_Tissue_2021','CellMarker_Augmented_2021','Azimuth_Cell_Types_2021')

## set up the function
annotation<-function(PWD,cluster_marker,db) {
  library(Seurat)
  library(dplyr)
  library(ggplot2)
  library(enrichR)
  setwd(PWD)

  cluster_marker <- read.delim(cluster_marker, header = T, sep = ",") 

  dir.create("annot_enrich") 
  for (i in unique(cluster_marker$cluster)) {
    dir.create(paste0(PWD,"/annot_enrich", "/clust",i))
  }

  for (i in unique(cluster_marker$cluster)) {
    for (j in 1:length(db)) { 
    setwd(PWD)
    setwd(paste0('./annot_enrich/clust',i))
    N1.c0 <- cluster_marker %>% filter(cluster == i & avg_log2FC > 0)
    genes <- N1.c0$gene
    N1.c0.Er <- enrichr(genes, databases = db[j])
    if(is.null(N1.c0.Er)) next
    plotEnrich(N1.c0.Er[[1]], showTerms = 20, numChar = 40, y = "Count", orderBy = "P.value") +
      ggtitle(paste0(db[j], " cluster ", i))
    ggsave(file = paste0("plotenrich_clust_",i, "_", j,".pdf"))
    N1.Er.genes.1 <- N1.c0.Er[[1]] %>% dplyr::select(Term, Genes, Combined.Score)
    write.csv(N1.Er.genes.1,paste0("Er_genes_clust_",i,"_",db[j],".csv"))
  }
}
}

## perform enrichment
annotation(PWD,cluster_marker,db)

Note: Users can define whichever libraries they want. For more information regarding the available libraries, see here.


Tool 2: Expression profiling of known marker genes

To profile the expression known marker genes, use the following command:

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 7 \
--knownmarkers T

The resulting output files are deposited into ~/working_directory/step7/*/visualize_features for individual expression or ~/working_directory/step7/*/module_score for aggregated expression. For a description of the outputs see here.


Tool 3: Reference-based annotation

To perform reference-based annotation, use the following command:

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 7 \
--referenceannotation T

The resulting output files are deposited into ~/working_directory/step7/*/reference_based_annotation. For a description of the outputs see here.


Adding annotations

To add cluster annotations to the Seurat object's metadata, use the following command:

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 7 \
--annotate T

The resulting output files are deposited into ~/working_directory/step7/*/annotate. For a description of the outputs see here.

Note: The cluster annotations from each iteration of the step will be retained, allowing users to define multiple clustering levels.