HTO analysis track: PBMC dataset

Contents


Introduction

This guide illustrates the steps taken for our analysis of the PBMC dataset in our pre-print manuscript. Here, we are using the HTO analysis track of scRNAbox to analyze a publicly available scRNAseq dataset produced by Stoeckius et al.. This data set describes peripheral blood mononuclear cells (PBMC) from eight human donors, which were tagged with sample-specific barcodes, pooled, and sequenced together in a single run.


Downloading the PBMC dataset

In you want to use the PBMC dataset to test the scRNAbox pipeline, please see here for detialed instructions on how to download the publicly available data.


Installation

scrnabox.slurm installation

To download the latest version of scrnabox.slurm (v0.1.52.50) run the following command:

wget https://github.com/neurobioinfo/scrnabox/releases/download/v0.1.52.5/scrnabox.slurm.zip
unzip scrnabox.slurm.zip

For a description of the options for running scrnabox.slurm run the following command:

bash /pathway/to/scrnabox.slurm/launch_scrnabox.sh -h 

If the scrnabox.slurm has been installed properly, the above command should return the folllowing:

scrnabox pipeline version 0.1.52.50
------------------- 
mandatory arguments:
                -d  (--dir)  = Working directory (where all the outputs will be printed) (give full path)
                --steps  =  Specify what steps, e.g., 2 to run step 2. 2-6, run steps 2 through 6

        optional arguments:
                -h  (--help)  = See helps regarding the pipeline arguments. 
                --method  = Select your preferred method: HTO and SCRNA for hashtag, and Standard scRNA, respectively. 
                --msd  = You can get the hashtag labels by running the following code (HTO Step 4). 
                --markergsea  = Identify marker genes for each cluster and run marker gene set enrichment analysis (GSEA) using EnrichR libraries (Step 7). 
                --knownmarkers  = Profile the individual or aggregated expression of known marker genes. 
                --referenceannotation  = Generate annotation predictions based on the annotations of a reference Seurat object (Step 7). 
                --annotate  = Add clustering annotations to Seurat object metadata (Step 7). 
                --addmeta  = Add metadata columns to the Seurat object (Step 8). 
                --rundge  = Perform differential gene expression contrasts (Step 8). 
                --seulist  = You can directly call the list of Seurat objects to the pipeline. 
                --rcheck  = You can identify which libraries are not installed.  

 ------------------- 
 For a comprehensive help, visit  https://neurobioinfo.github.io/scrnabox/site/ for documentation. 

CellRanger installation

For information regarding the installation of CellRanger, please visit the 10X Genomics documentation. If CellRanger is already installed on your HPC system, you may skip the CellRanger installation procedures.

For our analysis of the midbrain dataset we used the 10XGenomics GRCh38-3.0.0 reference genome and CellRanger v5.0.1. For more information regarding how to prepare reference genomes for the CellRanger counts pipeline, please see the 10X Genomics documentation.


R library preparation and R package installation

We must prepapre a common R library where we will load all of the required R packages. If the required R packages are already installed on your HPC system in a common R library, you may skip the following procedures.

We will first install R. The analyses presented in our pre-print manuscript were conducted using v4.2.1.

# install R
module load r/4.2.1

Then, we will run the installation code, which creates a directory where the R packages will be loaded and will install the required R packages:

# Folder for R packages 
R_PATH=~/path/to/R/library
mkdir -p $R_PATH

# Install package
Rscript ./scrnabox.slurm/soft/R/install_packages.R $R_PATH

scRNAbox pipeline

Step 0: Set up

Now that scrnabox.slurm, CellRanger, R, and the required R packages have been installed, we can proceed to our analysis with the scRNAbox pipeline. We will create a pipeline folder designated for the analysis and run Step 0, selecting the HTO analysis track (--method HTO), using the following code:

mkdir pipeline
cd pipeline

export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
export SCRNABOX_PWD=~/pipeline

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 0 \
--method HTO

Next, we will navigate to the scrnabox_config.ini file in ~/pipeline/job_info/configs to define the HPC account holder (ACCOUNT), the path to the environmental module (MODULEUSE), the path to CellRanger from the environmental module directory (CELLRANGER), CellRanger version (CELLRANGER_VERSION), R version (R_VERSION), and the path to the R library (R_LIB_PATH):

cd ~/pipeline/job_info/configs
nano scrnabox_config.ini

ACCOUNT=account-name
MODULEUSE=/path/to/environmental/module 
CELLRANGER=/path/to/cellranger/from/module/directory 
CELLRANGER_VERSION=5.0.1
R_VERSION=4.2.1
R_LIB_PATH=/path/to/R/library

Next, we can check to see if all of the required R packages have been properly installed using the following command:

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 0 \
--rcheck 

Step 1: FASTQ to gene expression matrix

In Step 1, we will run the CellRanger counts pipeline to generate feature-barcode expression matrices from the FASTQ files. While it is possible to manually prepare the library.csv and feature_ref.csv files for the sequencing run prior to running Step 1, for this analysis we are going to opt for automated library preparation. For more information regarding the manual prepartion of library.csv and feature_ref.csv files, please see the the CellRanger library preparation tutorial.

For our analysis of the PBMC dataset we set the following execution parameters for Step 1 (~/pipeline/job_info/parameters/step1_par.txt):

Parameter Value
par_automated_library_prep Yes
par_fastq_directory /path/to/directory/contaning/fastqs
par_RNA_run_names run1GEX
par_HTO_run_names run1HTO
par_seq_run_names run1
par_paired_end_seq Yes
par_id Hash1, Hash2, Hash3, Hash4, Hash5, Hash6, Hash7, Hash8
par_name A_TotalSeqA, B_TotalSeqA, C_TotalSeqA, D_TotalSeqA, E_TotalSeqA, F_TotalSeqA, G_TotalSeqA, H_TotalSeqA
par_read R2
par_pattern 5P(BC)
par_sequence AGGACCATCCAA, ACATGTTACCGT, AGCTTACTATCC, TCGATAATGCGA, GAGGCTGAGCTA, GTGTGACGTATT, ACTGTCTAACGG, TATCACATCGGT
par_ref_dir_grch ~/genome/10xGenomics/refdata-cellranger-GRCh38-3.0.0
par_r1_length NULL (commented out)
par_r2_length NULL (commented out)
par_mempercode 30
par_include_introns NULL (commented out)
par_no_target_umi_filter NULL (commented out)
par_expect_cells NULL (commented out)
par_force_cells NULL (commented out)
par_no_bam NULL (commented out)

Note: The parameters file for each step is located in ~/pipeline/job_info/parameters. For a comprehensive description of the execution parameters for each step see here.

Given that CellRanger runs a user interface and is not submitted as a Job, it is recommended to run Step 1 in a 'screen' which will allow the the task to keep running if the connection is broken. To run Step 1, use the following command:

export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
export SCRNABOX_PWD=~/pipeline

screen -S run_PBMC_application_case
bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 1

The outputs of the CellRanger counts pipeline are deposited into ~/pipeline/step1.


Step 2: Create Seurat object and remove ambient RNA

In Step 2, we are going to begin by correcting the RNA assay for ambient RNA removal using SoupX (Young et al. 2020). We will then use the the ambient RNA-corrected feature-barcode matrices to create a Seurat object.

For our analysis of the PBMC dataset we set the following execution parameters for Step 2 (~/pipeline/job_info/parameters/step2_par.txt):

Parameter Value
par_save_RNA Yes
par_save_metadata Yes
par_ambient_RNA Yes
par_normalization.method LogNormalize
par_scale.factor 10000
par_selection.method vst
par_nfeatures 2500

We can run Step 2 using the following code:

export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
export SCRNABOX_PWD=~/pipeline

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 2

Step 2 produces the following outputs:

~/pipeline
step2
├── figs2
│   ├── ambient_RNA_estimation_run1.pdf
│   ├── ambient_RNA_markers_run1.pdf
│   ├── cell_cyle_dim_plot_run1.pdf
│   ├── vioplot_run1.pdf
│   └── zoomed_in_vioplot_run1.pdf
├── info2
│   ├── estimated_ambient_RNA_run1.txt
│   ├── MetaData_1.txt
│   ├── meta_info_1.txt
│   ├── run1_ambient_rna_summary.rds
│   ├── sessionInfo.txt
│   ├── seu1_RNA.txt
│   └── summary_seu1.txt
├── objs2
│   └── run1.rds
└── step2_ambient
    └── run1
        ├── barcodes.tsv
        ├── genes.tsv
        └── matrix.mtxs 

Note: For a comprehensive description of the outputs for each analytical step, please see the Outputs section of the scRNAbox documentation.


Figure 1. Figures produced by Step 2 of the scRNAbox pipeline. A) Estimated ambient RNA contamination rate (Rho) by SoupX. Estimates of the RNA contamination rate using various estimators are visualized via a frequency distribution; the true contamination rate is assigned as the most frequent estimate (red line; 8.7%). B) Log10 ratios of observed counts to expected counts for marker genes from each cluster. Clusters are defined by the CellRanger counts pipeline. The red line displays the estimated RNA contamination rate if the estimation was based entirely on the corresponding gene. C) Principal component analysis (PCA) of Seurat S and G2M cell cycle reference genes. D) Violin plots showing the distribution of cells according to quality control metrics calculated in Step 2. E) Zoomed in violin plots, from the minimum to the mean, showing the distribution of cells according to quality control metrics calculated in Step 2.


Step 3: Quality control and filtering

In Step 3, we are going to perform quality control procedures and filter out low quality cells. We are going to filter out cells with < 50 unique RNA transcripts, > 6000 unique RNA transcripts, < 200 total RNA transcripts, > 7000 total RNA transcripts, and > 50% mitochondria.

For our analysis of the PBMC dataset we set the following execution parameters for Step 3 (~/pipeline/job_info/parameters/step2_par.txt):

Parameter Value
par_save_RNA Yes
par_save_metadata Yes
par_seurat_object NULL
par_nFeature_RNA_L 50
par_nFeature_RNA_U 6000
par_nCount_RNA_L 200
par_nCount_RNA_U 7000
par_mitochondria_percent_L 0
par_mitochondria_percent_U 50
par_ribosomal_percent_L 0
par_ribosomal_percent_U 100
par_remove_mitochondrial_genes No
par_remove_ribosomal_genes No
par_remove_genes NULL
par_regress_cell_cycle_genes Yes
par_normalization.method LogNormalize
par_scale.factor 10000
par_selection.method vst
par_nfeatures 2500
par_top 10
par_npcs_pca 30

We can run Step 3 using the following code:

export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
export SCRNABOX_PWD=~/pipeline

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 3

Step 3 produces the following outputs.

step3
├── figs3
│   ├── dimplot_pca_run1.pdf
│   ├── elbowplot_run1.pdf
│   ├── filtered_QC_vioplot_run1.pdf
│   └── VariableFeaturePlot_run1.pdf
├── info3
│   ├── MetaData_run1.txt
│   ├── meta_info_run1.txt
│   ├── most_variable_genes_run1.txt
│   ├── run1_RNA.txt
│   ├── sessionInfo.txt
│   └── summary_run1.txt
└── objs3
    └── run1.rds


Figure 2. Figures produced by Step 3 of the scRNAbox pipeline. A) Violin plots showing the distribution of cells according to quality control metrics after filtering by user-defined thresholds. B) Scatter plot showing the top 2500 most variable features; the top 10 most variable features are labelled. C) Principal component analysis (PCA) visualizing the first two principal component (PC). D) Elbow plot to visualize the percentage of variance explained by each PC.


Step 4: Demultiplexing and doublet detection

In Step 4, we are going to demultiplex the pooled samples and remove doublets (erroneous libraries produced by two or more cells) based on the expression of the sample-specific barcodes (antibody assay).

If the barcode labels used in the analysis are unknown, the first step is to retrieve them from the Seurat object. To do this, we do not need to modify the execution parameters and can go straight to running the following code:

export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
export SCRNABOX_PWD=~/pipeline

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 4 \
--msd T 

The above code produces the following file:

step4
├── figs4
├── info4
│   └── seu1.rds_old_antibody_label_MULTIseqDemuxHTOcounts.csv
└── objs4

Which contains the names of the barcode labels (i.e. A_TotalSeqA, B_TotalSeqA, C_TotalSeqA, D_TotalSeqA, E_TotalSeqA, F_TotalSeqA, G_TotalSeqA, H_TotalSeqA, Doublet, Negative).

Now that we know the barcode labels used in the PBMC dataset, we can perform demultiplexing and doublet detection.

For our analysis of the PBMC dataset we set the following execution parameters for Step 4 (~/pipeline/job_info/parameters/step4_par.txt):

Parameter Value
par_save_RNA Yes
par_save_metadata Yes
par_normalization.method CLR
par_scale.factor 10000
par_selection.method vst
par_nfeatures 2500
par_dimensionality_reduction Yes
par_npcs_pca 30
par_dims_umap 3
par_n.neighbor 65
par_dropDN Yes
par_label_dropDN Doublet, Negative
par_quantile 0.9
par_autoThresh TRUE
par_maxiter 5
par_RidgePlot_ncol 3
par_old_antibody_label A-TotalSeqA, B-TotalSeqA, C-TotalSeqA, D-TotalSeqA, E-TotalSeqA, F-TotalSeqA, G-TotalSeqA, H-TotalSeqA, Doublet
par_new_antibody_label sample-A, sample-B, sample-C, sample-D, sample-E, sample-F, sample-G, sample-H, Doublet

We can run Step 4 using the following code:

export SCRNABOX_HOME=~/scrnabox/scrnabox.slurm
export SCRNABOX_PWD=~/pipeline

bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 4

Step 4 produces the following outputs.

step4
├── figs4
│   ├── run1_DotPlot_HTO_MSD.pdf
│   ├── run1_Heatmap_HTO_MSD.pdf
│   ├── run1_HTO_dimplot_pca.pdf
│   ├── run1_HTO_dimplot_umap.pdf
│   ├── run1_nCounts_RNA_MSD.pdf
│   └── run1_Ridgeplot_HTO_MSD.pdf
├── info4
│   ├── run1_filtered_MULTIseqDemuxHTOcounts.csv
│   ├── run1_MetaData.txt
│   ├── run1_meta_info_.txt
│   ├── run1_MULTIseqDemuxHTOcounts.csv
│   ├── run1_RNA.txt
│   └── sessionInfo.txt
└── objs4
    └── run1.rds


Figure 3. Figures produced by Step 4 of the Cell Hashtag Analysis Track. A) Uniform Manifold Approximation and Projections (UMAP) plot, taking the first three pricipal components (PC) of the antibody assay as input. B) Principal component analysis (PCA) showing the first two PCs of the antibody assay. C) Ridgeplot visualizing the enrichment of barcode labels across sample assignments at the sample level. D) Dot plot visualizing the enrichment of barcode labels across sample assignments at the sample level. E) Heatmap visualizing the enrichment of barcode labels across sample assignments at the cel level. D) Violin plot visualizing the distribution of the number of total RNA transcripts identified per cell, startified by sample assignment.


Publication-ready figures

The code used to produce the publication-ready figures used in our pre-print manuscript is avaliable here here.


Job Configurations

The following job configurations were used for our analysis of the PBMC dataset. Job Configurations can be modified for each analytical step in the scrnabox_config.ini file in ~/pipeline/job_info/configs

Step THREADS_ARRAY MEM_ARRAY WALLTIME_ARRAY
Step2 4 16g 00-05:00
Step3 4 16g 00-05:00
Step4 4 16g 00-05:00