Step 7: Cluster annotation

In Step 7, cluster annotation is performed to define the cell types comprising the clusters identified in Step 6. ScRNAbox provides three tools to identify cell types comprising the clusters:

Tool 1: Cluster marker gene identification and gene set enrichment analysis (GSEA)
Seurat's FindAllMarkers function is used to identify differentially expressed marker genes (DEG) by the Wilcoxon rank-sum test (Macosko et al. 2015). DEGs in the positive direction (Log2 fold-change > 0.00) are then tested for enrichment across user-defined gene set libraries that define cell types using the EnrichR tool (Chen et al. 2013).

Tool 2: Expression profiling of cell type markers and module scores
Users can visualize the expression of individual genes and the aggregated expression of multiple genes. For each gene in a user-defined list, plots are produced to visualize its expression at the cluster or cell level. The aggregated expression of genes in a user-defined list are calculated using the Seurat AddModuleScore function (Tirosh et al. 2016.

Tool 3: Cell type predictions based on reference data Seurat's FindTransferAnchors and TransferData functions are used to leverage cell-type annotations from a reference Seurat object and generate annotation predictions for the query dataset (Butlet et al. 2019).

Additionally, users can add cluster annotations to the Seurat object.

The following parameters are adjustable for Step 7 (~/working_directory/job_info/parameters/step7_par.txt):

Annotation tool	Parameter	Default	Description
General	par_save_RNA	Yes	Whether or not to export an RNA expression matrix
General	par_save_metadata	Yes	Whether or not to export a metadata dataframe
General	par_seurat_object	NULL	If users already have a Seurat object, they may provide the path to the Seurat object to initiate the pipeline at Step 7
General	par_level_cluster	integrated_snn_res.0.75	The cluster resolution that you want to annotate. If you skipped integration in Step 5, use par_level_cluster='RNA_snn_res.0.7', if you want to proceed with a clustering resolution of 0.7.
Tool 1	par_run_find_marker	Yes	Whether or not to find marker genes for each cluster
Tool 1	par_run_enrichR	No	Whether or not to run gene set enrichment analysis (GSEA) on the marker genes for each cluster using the EnrichR tools. Note that the HPC must have access to the internet to run GSEA.
Tool 1	par_top_sel	5	Number of top markers to identify based on avg_log2FC
Tool 1	par_db	Descartes_Cell_Types_and_Tissue_2021, CellMarker_Augmented_2021, Azimuth_Cell_Types_2021	Character vector of EnrichR databases that define cell types. The top marker genes for each cluster will be tested for enrichment across these databases.
Tool 2	par_run_module_score	Yes	Whether or not to compute module score for aggregated expression
Tool 2	par_run_visualize_markers	Yes	Whether or not to visualize the expression of individual genes
Tool 2	par_module_score	NULL	Path to the csv file containing the gene sets for the module score
Tool 2	par_select_features_list	NULL	List of genes whose expression will be visualized individually
Tool 2	par_select_features_csv	NULL	If you want to define multiple lists of features to visualize individually, you can do so with a csv file. The header should contain the list names and all features belonging to the same list should be in the same column.
Tool 3	par_reference	NULL	Path defining the location of the reference Seurat object
Tool 3	par_reference_name	Reference	An arbitrary name for the reference object. This will be used to name the metadata slot.
Tool 3	par_level_celltype	NULL	The name of the metadata column in the reference Seurat object that defines cell types
Tool 3	par_FindTransferAnchors_dim	50	Number of dimensions from linear dimensional reduction used to find transfer anchors between the reference and query Seurat objects
Tool 3	par_futureglobalsmaxSize	60000 * 1024^2	This will increase your RAM usage so set this number mindfully
Annotate	par_annotate_resolution	integrated_snn_res.0.75	Which clustering resolution you want to annotate
Annotate	par_name_metadata	Celltypes1	The name of the metadata slot that will contain the annotations
Annotate	par_annotate_labels	NULL	A list of cluster labels. There must as many labels as clusters at the defined clustering resolution. Please refrain from using "_" when annotating.

Tool 1: Cluster marker GSEA

To run cluster marker GSEA, use the following command:

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 7 \
--markergsea T

The resulting output files are deposited into ~/working_directory/step7/*/marker. For a description of the outputs see here.

Note: In order to test the cluster marker genes for enrichment across EnrichR libraries, the HPC must have access to the internet. If your HPC cannot access the internet, you must set par_run_enrichR = "no". The pipeline will still run differential gene expression and find the markers for each cluster. You can then take the pipeline output and run the enrichment step on your local machine directly in R. To do so, begin by downloading the ClusterMarkers.csv file obtained from running the above command with par_run_find_marker = "yes" to your computer:

scp username@beluga.computecanada.ca:~/working_directory/step7/info7/marker/ClusterMarkers.csv ~/Desktop/working_directory

Then run the following code in R:

#set up the environment
library(Seurat)
library(dplyr)
library(ggplot2)
library(enrichR)


## set up the parameters
PWD <-'/path/directory/where/outputs/will/be/deposited'
cluster_marker <- '/path/to/ClusterMarkers.csv'
db <- c('Descartes_Cell_Types_and_Tissue_2021','CellMarker_Augmented_2021','Azimuth_Cell_Types_2021')

## set up the function
annotation<-function(PWD,cluster_marker,db) {
  library(Seurat)
  library(dplyr)
  library(ggplot2)
  library(enrichR)
  setwd(PWD)

  cluster_marker <- read.delim(cluster_marker, header = T, sep = ",") 

  dir.create("annot_enrich") 
  for (i in unique(cluster_marker$cluster)) {
    dir.create(paste0(PWD,"/annot_enrich", "/clust",i))
  }

  for (i in unique(cluster_marker$cluster)) {
    for (j in 1:length(db)) { 
    setwd(PWD)
    setwd(paste0('./annot_enrich/clust',i))
    N1.c0 <- cluster_marker %>% filter(cluster == i & avg_log2FC > 0)
    genes <- N1.c0$gene
    N1.c0.Er <- enrichr(genes, databases = db[j])
    if(is.null(N1.c0.Er)) next
    plotEnrich(N1.c0.Er[[1]], showTerms = 20, numChar = 40, y = "Count", orderBy = "P.value") +
      ggtitle(paste0(db[j], " cluster ", i))
    ggsave(file = paste0("plotenrich_clust_",i, "_", j,".pdf"))
    N1.Er.genes.1 <- N1.c0.Er[[1]] %>% dplyr::select(Term, Genes, Combined.Score)
    write.csv(N1.Er.genes.1,paste0("Er_genes_clust_",i,"_",db[j],".csv"))
  }
}
}

## perform enrichment
annotation(PWD,cluster_marker,db)

Note: Users can define whichever libraries they want. For more information regarding the available libraries, see here.

Tool 2: Expression profiling of known marker genes

To profile the expression known marker genes, use the following command:

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 7 \
--knownmarkers T

The resulting output files are deposited into ~/working_directory/step7/*/visualize_features for individual expression or ~/working_directory/step7/*/module_score for aggregated expression. For a description of the outputs see here.

Tool 3: Reference-based annotation

To perform reference-based annotation, use the following command:

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 7 \
--referenceannotation T

The resulting output files are deposited into ~/working_directory/step7/*/reference_based_annotation. For a description of the outputs see here.

Adding annotations

To add cluster annotations to the Seurat object's metadata, use the following command:

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 7 \
--annotate T

The resulting output files are deposited into ~/working_directory/step7/*/annotate. For a description of the outputs see here.

Note: The cluster annotations from each iteration of the step will be retained, allowing users to define multiple clustering levels.