Ensemblex pipeline with prior genotype information


Introduction

This guide illustrates how to use the Ensemblex pipeline to demultiplexed pooled scRNAseq samples with prior genotype information. Here, we will leverage a pooled scRNAseq dataset produced by Jerber et al.. This pool contains induced pluripotent cell lines (iPSC) from 9 healthy controls that were differentiated towards a dopaminergic neuron state. The Ensemblex pipeline is illustrated in the diagram below:

NOTE: To download the necessary files for the tutorial please see the Downloading data section of the Ensemblex documentation.


Installation

[to be completed]

module load StdEnv/2023 module load apptainer/1.2.4


Step 1: Set up

In Step 1, we will set up the working directory for the Ensemblex pipeline and decide which version of the pipeline we want to use.

First, create a dedicated folder for the analysis (hereafter referred to as the working directory). Then, define the path to the working directory and the path to ensemblex.pip:

## Create and navigate to the working directory
cd ensemblex_tutorial
mkdir working_directory
cd ~/ensemblex_tutorial/working_directory

## Define the path to ensemblex.pip
ensemblex_HOME=~/ensemblex.pip

## Define the path to the working directory
ensemblex_PWD=~/ensemblex_tutorial/working_directory

Next, we can set up the working directory and choose the Ensemblex pipeline for demultiplexing with prior genotype information (--step init-GT) using the following code:

bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-GT

After running the above code, the working directory should have the following structure:

ensemblex_tutorial
└── working_directory
    ├── demuxalot
    ├── demuxlet
    ├── ensemblex_gt
    ├── input_files
    ├── job_info
    │   ├── configs
    │   │   └── ensemblex_config.ini
    │   ├── logs
    │   └── summary_report.txt
    ├── souporcell
    └── vireo_gt

Upon setting up the Ensemblex pipeline, we can proceed to Step 2 where we will prepare the input files for Ensemblex's constituent genetic demultiplexing tools.


Step 2: Preparation of input files

In Step 2, we will define the necessary files needed for ensemblex's constituent genetic demultiplexing tools and will place them within the working directory.

Note: For the tutorial we will be using the data downloaded in the Downloading data section of the Ensemblex documentation.

First, define all of the required files:

BAM=~/ensemblex_tutorial/CellRanger/outs/possorted_genome_bam.bam

BAM_INDEX=~/ensemblex_tutorial/CellRanger/outs/possorted_genome_bam.bam.bai

BARCODES=~/ensemblex_tutorial/CellRanger/outs/filtered_gene_bc_matrices/refdata-cellranger-GRCh37/barcodes.tsv

SAMPLE_VCF=~/ensemblex_tutorial/sample_genotype/sample_genotype_merge.vcf

REFERENCE_VCF=~/ensemblex_tutorial/reference_files/common_SNPs_only.recode.vcf

REFERENCE_FASTA=~/ensemblex_tutorial/reference_files/genome.fa

REFERENCE_FASTA_INDEX=~/ensemblex_tutorial/reference_files/genome.fa.fai

Next, we will sort the pooled samples and reference .vcf files according to the .bam file and place them within the working directory:

## Sort pooled samples .vcf file
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD/input_files/pooled_samples.vcf --step sort --vcf $SAMPLE_VCF --bam $ensemblex_PWD/input_files/pooled_bam.bam

## Sort reference .vcf file
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD/input_files/reference.vcf --step sort --vcf $SAMPLE_VCF --bam $ensemblex_PWD/input_files/pooled_bam.bam

NOTE: To sort the vcf files we use the pipeline produced by the authors of Demuxlet/Freemuxlet (Kang et al. ).

Next, we will place the remaining necessary files within the working directory:

cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam
cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai 
cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv
cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa
cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai

After running the above code, $ensemblex_PWD/input_files should contain the following files:

input_files
├── pooled_bam.bam
├── pooled_bam.bam.bai
├── pooled_barcodes.tsv
├── pooled_samples.vcf
├── reference.fa
├── reference.fa.fai
└── reference.vcf

NOTE: It is important that the file names match those listed above as they are necessary for the Ensemblex pipeline to recognize them.


Step 3: Genetic demultiplexing by constituent tools

In Step 3, we will demultiplex the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools:

First, we will navigate to the ensemblex_config.ini file to adjust the demultiplexing parameters for each of the constituent genetic demultiplexing tools:

## Navigate to the .ini file
cd $ensemblex_PWD/job_info/configs

## Open the .ini file and adjust parameters directly in the terminal
nano ensemblex_config.ini

For the tutorial, we set the following parameters for the constituent genetic demultiplexing tools:

Parameter Value
PAR_demuxalot_genotype_names 'HPSI0115i-hecn_6,HPSI0214i-pelm_3,HPSI0314i-sojd_3,HPSI0414i-sebn_3,HPSI0514i-uenn_3,HPSI0714i-pipw_4,HPSI0715i-meue_5,HPSI0914i-vaka_5,HPSI1014i-quls_2'
PAR_demuxalot_prior_strength 100
PAR_demuxalot_minimum_coverage 200
PAR_demuxalot_minimum_alternative_coverage 10
PAR_demuxalot_n_best_snps_per_donor 100
PAR_demuxalot_genotypes_prior_strength 1
PAR_demuxalot_doublet_prior 0.25
PAR_demuxlet_field GT
PAR_vireo_N 9
PAR_vireo_type GT
PAR_vireo_processes 20
PAR_vireo_minMAF 0.1
PAR_vireo_minCOUNT 20
PAR_vireo_forcelearnGT T
PAR_minimap2 '-ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no'
PAR_freebayes '-iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6'
PAR_vartrix_umi TRUE
PAR_vartrix_mapq 30
PAR_vartrix_threads 8
PAR_souporcell_k 9
PAR_souporcell_t 8

Now that the parameters have been defined, we can demultiplex the pools with the constituent genetic demultiplexing tools.


Demuxalot

To run Demuxalot use the following code:

bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot

If Demuxalot completed successfully, the following files should be available in $ensemblex_PWD/demuxalot:

demuxalot
    ├── Demuxalot_result.csv
    └── new_snps_single_file.betas

Demuxlet

To run Demuxlet use the following code:

bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxlet

If Demuxlet completed successfully, the following files should be available in $ensemblex_PWD/demuxlet:

demuxlet
    ├── outs.best
    ├── pileup.cel.gz
    ├── pileup.plp.gz
    ├── pileup.umi.gz
    └── pileup.var.gz

Souporcell

To run Souporcell use the following code:

bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell

If Souporcell completed successfully, the following files should be available in $ensemblex_PWD/souporcell:

souporcell
    ├── alt.mtx
    ├── cluster_genotypes.vcf
    ├── clusters_tmp.tsv
    ├── clusters.tsv
    ├── fq.fq
    ├── minimap.sam
    ├── minitagged.bam
    ├── minitagged_sorted.bam
    ├── minitagged_sorted.bam.bai
    ├── Pool.vcf
    ├── ref.mtx
    └── soup.txt

Vireo

To run Vireo-GT use the following code:

bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo

If Vireo-GT completed successfully, the following files should be available in $ensemblex_PWD/vireo_gt:

vireo_gt
    ├── cellSNP.base.vcf.gz
    ├── cellSNP.cells.vcf.gz
    ├── cellSNP.samples.tsv
    ├── cellSNP.tag.AD.mtx
    ├── cellSNP.tag.DP.mtx
    ├── cellSNP.tag.OTH.mtx
    ├── donor_ids.tsv
    ├── fig_GT_distance_estimated.pdf
    ├── fig_GT_distance_input.pdf
    ├── GT_donors.vireo.vcf.gz
    ├── _log.txt
    ├── prob_doublet.tsv.gz
    ├── prob_singlet.tsv.gz
    └── summary.tsv

Upon demultiplexing the pooled samples with each of Ensemblex's constituent genetic demultiplexing tools, we can proceed to Step 4 where we will process the output files of the consituent tools with the Ensemblex algorithm to generate the ensemble sample classifications

NOTE: To minimize computation time for the tutorial, we have provided the necessary outpu files from the constituent tools here. To access the files and place them in the working directory, use the following code:

## Demuxalot
cd $ensemblex_PWD/demuxalot
wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/Demuxalot_result.csv

## Demuxlet
cd $ensemblex_PWD/demuxlet
wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/outs.best

## Souporcell
cd $ensemblex_PWD/souporcell
wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/clusters.tsv

## Vireo
cd $ensemblex_PWD/vireo_gt
wget https://github.com/neurobioinfo/ensemblex/blob/caad8c250566bfa9a6d7a78b77d2cc338468a58e/tutorial/donor_ids.tsv


Step 4: Application of Ensemblex

In Step 4, we will process the output files of the four constituent genetic demultiplexing tools with the three-step Ensemblex algorithm:

  • Step 1: Probabilistic-weighted ensemble
  • Step 2: Graph-based doublet detection
  • Step 3: Step 3: Ensemble-independent doublet detection

First, we will navigate to the ensemblex_config.ini file to adjust the demultiplexing parameters for the Ensemblex algorithm:

## Navigate to the .ini file
cd $ensemblex_PWD/job_info/configs

## Open the .ini file and adjust parameters directly in the terminal
nano ensemblex_config.ini

For the tutorial, we set the following parameters for the Ensemblex algorithm:

Parameter Value
Pool parameters
PAR_ensemblex_sample_size 9
PAR_ensemblex_expected_doublet_rate 0.10
Set up parameters
PAR_ensemblex_merge_constituents Yes
Step 1 parameters: Probabilistic-weighted ensemble
PAR_ensemblex_probabilistic_weighted_ensemble Yes
Step 2 parameters: Graph-based doublet detection
PAR_ensemblex_preliminary_parameter_sweep No
PAR_ensemblex_nCD NULL
PAR_ensemblex_pT NULL
PAR_ensemblex_graph_based_doublet_detection Yes
Step 3 parameters: Ensemble-independent doublet detection
PAR_ensemblex_preliminary_ensemble_independent_doublet No
PAR_ensemblex_ensemble_independent_doublet Yes
PAR_ensemblex_doublet_Demuxalot_threshold Yes
PAR_ensemblex_doublet_Demuxalot_no_threshold No
PAR_ensemblex_doublet_Demuxlet_threshold No
PAR_ensemblex_doublet_Demuxlet_no_threshold No
PAR_ensemblex_doublet_Souporcell_threshold No
PAR_ensemblex_doublet_Souporcell_no_threshold No
PAR_ensemblex_doublet_Vireo_threshold Yes
PAR_ensemblex_doublet_Vireo_no_threshold No
Confidence score parameters
PAR_ensemblex_compute_singlet_confidence Yes

If Ensemblex completed successfully, the following files should be available in $ensemblex_PWD/ensemblex_gt:

ensemblex_gt
├── confidence
│   └── ensemblex_final_cell_assignment.csv
├── constituent_tool_merge.csv
├── step1
│   ├── ARI_demultiplexing_tools.pdf
│   ├── BA_demultiplexing_tools.pdf
│   ├── Balanced_accuracy_summary.csv
│   └── step1_cell_assignment.csv
├── step2
│   ├── optimal_nCD.pdf
│   ├── optimal_pT.pdf
│   ├── PC1_var_contrib.pdf
│   ├── PC2_var_contrib.pdf
│   ├── PCA1_graph_based_doublet_detection.pdf
│   ├── PCA2_graph_based_doublet_detection.pdf
│   ├── PCA3_graph_based_doublet_detection.pdf
│   ├── PCA_plot.pdf
│   ├── PCA_scree_plot.pdf
│   └── Step2_cell_assignment.csv
└── step3
    ├── Doublet_overlap_no_threshold.pdf
    ├── Doublet_overlap_threshold.pdf
    ├── Number_ensemblex_doublets_EID_no_threshold.pdf
    ├── Number_ensemblex_doublets_EID_threshold.pdf
    └── Step3_cell_assignment.csv

Ensemblex's final assignments are described in the ensemblex_final_cell_assignment.csv file.

Specifically, the ensemblex_assignment column describes Ensemblex's final assignments after application of the singlet confidence threshold (i.e., singlets that fail to meet a singlet confidence of 1.0 are labelled as unassigned); we recomment that users use this column to label their cells for downstream analyses. The ensemblex_best_assignment column describes Ensemblex's best assignments, independent of the singlets confidence threshold (i.e., singlets that fail to meet a singlet confidence of 1.0 are NOT labelled as unassigned).

The cell barcodes listed under the barcode column can be used to add the ensemblex_final_cell_assignment.csv information to the metadata of a Seurat object.


Resource requirements

The following table describes the computational resources used in this tutorial for genetic demultiplexing by the constituent tools and application of the Ensemblex algorithm.

Tool Time CPU Memory
Demuxalot 01:34:59 6 12.95 GB
Demuxlet 03:16:03 6 138.32 GB
Souporcell 2-14:49:21 1 21.83 GB
Vireo 2-01:30:24 6 29.42 GB
Ensemblex 02:05:27 1 5.67 GB