Step 4: Application of Ensemblex


Introduction

In Step 4, we will process the output files from the constituent genetic demultiplexing tools with the Ensemblex framework. Ensemblex processes the output files in a three-step pipeline to identify the most probable sample label for each cell based on the predictions of the constituent tools:

Step 1: Probabilistic-weighted ensemble
In Step 1, Ensemblex utilizes an unsupervised weighting model to identify the most probable sample label for each cell. Ensemblex weighs each constituent tool’s assignment probability distribution by its estimated balanced accuracy for the dataset. The weighted assignment probabilities across all four constituent tools are then used to inform the most probable sample label for each cell.

Step 2: Graph-based doublet detection
In Step 2, Ensemblex utilizes a graph-based approach to identify doublets that were incorrectly labeled as singlets in Step 1. Pooled cells are embedded into PCA space and the most confident doublets in the pool (nCD) are identified. Then, based on the Euclidean distance in PCA space, the pooled cells that surpass the percentile threshold (pT) of the nearest neighbour frequency to the confident doublets are labelled as doublets by Ensemblex. Ensemblex performs an automated parameter sweep to identify the optimal nCD and pT values; however, user can opt to manually define these parameters.

Step 3: Ensemble-independent doublet detection
In Step 3, Ensemblex utilizes an ensemble-independent approach to further improve doublet detection. Here, cells that are labelled as doublets by Demuxalot or Vireo are labelled as doublets by Ensemblex; however, users can nominate different tools to utilize for Step 3, depending on the desired doublet detection stringency.


Ensemblex parameters

Users can choose to run each step of the Ensemblex framework sequentially (Steps 1 to 3) or can opt to skip certain steps. While Step 1 is necessary to generate the ensemble sample labels, Steps 2 and 3 were implemented to improve Ensemblex's ability to identify doublets; thus, if users do not want to prioritize doublet detection, they may skip Steps 2 and/or 3. Nonetheless, we demonstrated in our pre-print manuscript that utilizing the entire Ensemblex framework is important for maximizing the demultiplexing accuracy. Users can define which steps of the Ensemblex framework they want to utilize in the adjustable parameters file.

The adjustable parameters file (ensemblex_config.ini) is located in ~/working_directory/job_info/configs/. For a comprehensive description of how to adjust the analytical parameters of the Ensemblex pipeline please see Execution parameters. The following parameters are adjustable when applying the Ensemblex algorithm:

Parameter Default Description
Pool parameters
PAR_ensemblex_sample_size NULL Number of samples multiplexed in the pool.
PAR_ensemblex_expected_doublet_rate NULL Expected doublet rate for the pool. If using 10X Genomics, the expected doublet rate can be estimated based on the number of recovered cells. For more information see 10X Genomics Documentation.
Set up parameters
PAR_ensemblex_merge_constituents Yes Whether or not to merge the output files of the constituent demultiplexing tools. If running Ensemblex on a pool for the first time, this parameter should be set to "Yes". Subsequent runs of ensemblex (e.g., parameter optimization) can have this parameter set to "No" as the pipeline will automatically detect the previously generated merged file.
Step 1 parameters: Probabilistic-weighted ensemble
PAR_ensemblex_probabilistic_weighted_ensemble Yes Whether or not to perform Step 1: Probabilistic-weighted ensemble. If running Ensemblex on a pool for the first time, this parameter should be set to "Yes". Subsequent runs of ensemblex (e.g., parameter optimization) can have this parameter set to "No" as the pipeline will automatically detect the previously generated Step 1 output file.
Step 2 parameters: Graph-based doublet detection
PAR_ensemblex_preliminary_parameter_sweep No Whether or not to perform a preliminary parameter sweep for Step 2: Graph-based doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define the number of confident doublets in the pool (nCD) and the percentile threshold of the nearest neighour frequency (pT), which can be defined in the following two parameters, respectively.
PAR_ensemblex_nCD NULL Manually defined number of confident doublets in the pool (nCD). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to "Yes".
PAR_ensemblex_pT NULL Manually defined percentile threshold of the nearest neighour frequency (pT). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to "Yes".
PAR_ensemblex_graph_based_doublet_detection Yes Whether or not to perform Step 2: Graph-based doublet detection. If PAR_ensemblex_nCD and PAR_ensemblex_pT are not defined by the user (NULL), Ensemblex will automatically determine the optimal parameter values using an unsupervised parameter sweep. If PAR_ensemblex_nCD and PAR_ensemblex_pT are defined by the user, graph-based doublet detection will be performed with the user-defined values.
Step 3 parameters: Ensemble-independent doublet detection
PAR_ensemblex_preliminary_ensemble_independent_doublet No Whether or not to perform a preliminary parameter sweep for Step 3: Ensemble-independent doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define which constituent tools to utilize for ensemble-independent doublet detection. Users can define which tools to utilize for ensemble-independent doublet detection in the following parameters.
PAR_ensemblex_ensemble_independent_doublet Yes Whether or not to perform Step 3: Ensemble-independent doublet detection.
PAR_ensemblex_doublet_Demuxalot_threshold Yes Whether or not to label doublets identified by Demuxalot as doublets. Only doublets with assignment probabilities exceeding Demuxalot's recommended probability threshold will be labeled as doublets by Ensemblex.
PAR_ensemblex_doublet_Demuxalot_no_threshold No Whether or not to label doublets identified by Demuxalot as doublets, regardless of the corresponding assignment probability.
PAR_ensemblex_doublet_Demuxlet_threshold No Whether or not to label doublets identified by Demuxlet as doublets. Only doublets with assignment probabilities exceeding Demuxlet's recommended probability threshold will be labeled as doublets by Ensemblex.
PAR_ensemblex_doublet_Demuxlet_no_threshold No Whether or not to label doublets identified by Demuxlet as doublets, regardless of the corresponding assignment probability.
PAR_ensemblex_doublet_Souporcell_threshold No Whether or not to label doublets identified by Souporcell as doublets. Only doublets with assignment probabilities exceeding Souporcell's recommended probability threshold will be labeled as doublets by Ensemblex.
PAR_ensemblex_doublet_Souporcell_no_threshold No Whether or not to label doublets identified by Souporcell as doublets, regardless of the corresponding assignment probability.
PAR_ensemblex_doublet_Vireo_threshold Yes Whether or not to label doublets identified by Vireo as doublets. Only doublets with assignment probabilities exceeding Vireo's recommended probability threshold will be labeled as doublets by Ensemblex.
PAR_ensemblex_doublet_Vireo_no_threshold No Whether or not to label doublets identified by Vireo as doublets, regardless of the corresponding assignment probability.
Confidence score parameters
PAR_ensemblex_compute_singlet_confidence Yes Whether or not to compute Ensemblex's singlet confidence score. This will define low confidence assignments which should be removed from downstream analyses.

Applying the Ensemblex algorithm

To apply the Ensemblex algorithm use the following code:

ensemblex_HOME=/path/to/ensemblex.pip
ensemblex_PWD=/path/to/working_directory

bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step ensemblexing

If the ensemblex algorithm completed successfully, the following files should be available in ~/working_directory/ensemblex

working_directory
└── ensemblex
    ├── confidence
    │   └── ensemblex_final_cell_assignment.csv
    ├── constituent_tool_merge.csv
    ├── step1
    │   ├── ARI_demultiplexing_tools.pdf
    │   ├── BA_demultiplexing_tools.pdf
    │   ├── Balanced_accuracy_summary.csv
    │   └── step1_cell_assignment.csv
    ├── step2
    │   ├── optimal_nCD.pdf
    │   ├── optimal_pT.pdf
    │   ├── PC1_var_contrib.pdf
    │   ├── PC2_var_contrib.pdf
    │   ├── PCA1_graph_based_doublet_detection.pdf
    │   ├── PCA2_graph_based_doublet_detection.pdf
    │   ├── PCA3_graph_based_doublet_detection.pdf
    │   ├── PCA_plot.pdf
    │   ├── PCA_scree_plot.pdf
    │   └── Step2_cell_assignment.csv
    └── step3
        ├── Doublet_overlap_no_threshold.pdf
        ├── Doublet_overlap_threshold.pdf
        ├── Number_Ensemblux_doublets_EID_no_threshold.pdf
        ├── Number_Ensemblux_doublets_EID_threshold.pdf
        └── Step3_cell_assignment.csv

For a comprehensive description of the Ensemblex algorithm output files, please see Ensemblex outputs.