Ensemblex algorithm outputs


Introduction

After applying the Ensemblex algorithm to the output files of the constituent genetic demultiplexing tools in Step 4, the ~/working_directory/ensemblex folder will have the following structure:

working_directory
└── ensemblex
    ├── constituent_tool_merge.csv
    ├── step1
    ├── step2
    ├── step3
    └── confidence
  • constituent_tool_merge.csv is the merged outputs from each constituent genetic demultiplexing tool.
  • step1/ contains the outputs from Step 1: probabilistic-weighted ensemble.
  • step2/ contains the outputs from Step 2: graph-based doublet detection.
  • step3/ contains the outputs from Step 3: ensemble-independent doublet detection.
  • confidence/ contains the final Ensemblex output file, whose sample labels have been annotate with the Ensemblex signlet confidence score.

Note: If users re-run a step of the Ensemblex workflow, the outputs from the previous run will automatically be overwritten. If you do not want to lose the outputs from a previous run, it is important to copy the materials to a separate directory.


Outputs

Merging constituent output files

Ensemblex begins by merging the output files of the constituent genetic demultiplexing tools by cell barcode, which produces the constituent_tool_merge.csv file. In this file, each constituent genetic demultiplexing tool has two columns corresponding to their sample labels:

  • demuxalot_assignment
  • demuxalot_best_assignment
  • demuxlet_assignment
  • demuxlet_best_assignment
  • souporcell_assignment
  • souporcell_best_assignment
  • vireo_assignment
  • vireo_best_assignment

Taking Vireo as an example, vireo_assignment shows Vireo's sample labels after applying its recommended probability threshold; thus, cells that do not meet Vireo's recommended probability threshold will be labeled as "unassigned". In turn, vireo_best_assignment shows Vireo's best guess assignments with out applying the recommended probability threshold; thus, cells that do not meet Vireo's recommended probability threshold will still show the best sample label and will not be labelled as "unassigned".

The constituent_tool_merge.csv file also contains a general_consensus column. This is not Ensemblex's sample labels. The general_consensus column simply shows the sample labels that result from a majority vote classifier; split decisions are labeled as unassigned.


Step 1: Accuracy-weighted probabilistic ensemble

After running Step 1 of the Ensemblex algorithm, the /PWE folder will contain the following files:

working_directory
└── ensemblex
    └── step1
        ├── ARI_demultiplexing_tools.pdf
        ├── BA_demultiplexing_tools.pdf
        ├── Balanced_accuracy_summary.csv
        └── Step1_cell_assignment.csv
Output type Name Description
Figure ARI_demultiplexing_tools.pdf Heatmap showing the Adjusted Rand Index (ARI) between the sample labels of the constituent genetic demultiplexing tools.
Figure BA_demultiplexing_tools.pdf Barplot showing the estimated balanced accuracy for each constituent genetic demultiplexing tool.
File Balanced_accuracy_summary.csv Summary file describing the estimated balanced accuracy computation for each constituent genetic demultiplexing tool.
File Step1_cell_assignment.csv Data file containing Ensemblex's sample labels after Step 1: accuracy-weighted probabilistic ensemble.

The Step1_cell_assignment.csv file contains the following important columns:

  • ensemblex_assignment: Ensemblex sample labels after performing accuracy-weighted probabilistic ensemble.
  • ensemblex_probability: Accuracy-weighted ensemble probability corresponding to Ensemblex's sample labels.

NOTE: Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats.


Step 2: Graph-based doublet detection

After running Step 2 of the Ensemblex algorithm, the /GBD folder will contain the following files:

working_directory
└── ensemblex
    └── step2
        ├── optimal_nCD.pdf
        ├── optimal_pT.pdf
        ├── PC1_var_contrib.pdf
        ├── PC2_var_contrib.pdf
        ├── PCA1_graph_based_doublet_detection.pdf
        ├── PCA2_graph_based_doublet_detection.pdf
        ├── PCA3_graph_based_doublet_detection.pdf
        ├── PCA_plot.pdf
        ├── PCA_scree_plot.pdf
        └── Step2_cell_assignment.csv
Output type Name Description
Figure optimal_nCD.pdf Dot plot showing the optimal nCD value.
Figure optimal_pT.pdf Dot plot showing the optimal pT value.
Figure PC1_var_contrib.pdf Bar plot showing the contribution of each variable to the variation across the first principal component.
Figure PC2_var_contrib.pdf Bar plot showing the contribution of each variable to the variation across the second principal component.
Figure PCA1_graph_based_doublet_detection.pdf PCA showing Ensemblex sample labels (singlet or doublet) prior to performing graph-based doublet detection.
Figure PCA2_graph_based_doublet_detection.pdf PCA showing the cells identified as the n most confident doublets in the pool.
Figure PCA3_graph_based_doublet_detection.pdf PCA showing Ensemblex sample labels (singlet or doublet) after performing graph-based doublet detection.
Figure PCA_plot.pdf PCA of pooled cells.
Figure PCA_scree_plot.pdf Bar plot showing the variance explained by each principal component.
File Step2_cell_assignment.csv Data file containing Ensemblex's sample labels after Step 2: graph-based doublet detection.

The Step2_cell_assignment.csv file contains the following important column:

  • ensemblex_assignment: Ensemblex sample labels after performing graph-based doublet detection.

NOTE: Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats.


Step 3: Ensemble-independent doublet detection

After running Step 3 of the Ensemblex algorithm, the /EID folder will contain the following files:

working_directory
└── ensemblex
    └── step3
        ├── Doublet_overlap_no_threshold.pdf
        ├── Doublet_overlap_threshold.pdf
        ├── Number_ensemblex_doublets_EID_no_threshold.pdf
        ├── Number_ensemblex_doublets_EID_threshold.pdf
        └── Step3_cell_assignment.csv

Output type Name Description
Figure Doublet_overlap_no_threshold.pdf Proportion of doublet calls overlapping between constituent genetic demultiplexing tools without applying assignment probability thresholds.
Figure Doublet_overlap_threshold.pdf Proportion of doublet calls overlapping between constituent genetic demultiplexing tools after applying assignment probability thresholds.
Figure Number_ensemblex_doublets_EID_no_threshold.pdf Number of cells that would be labelled as doublets by Ensemblex if a constituent tool was nominated for ensemble-independent doublet detection, without applying assignment probability thresholds.
Figure Number_ensemblex_doublets_EID_threshold.pdf Number of cells that would be labelled as doublets by Ensemblex if a constituent tool was nominated for ensemble-independent doublet detection, after applying assignment probability thresholds.
File Step3_cell_assignment.csv Data file containing Ensemblex's sample labels after Step 3: ensemble-independent doublet detection.

The Step3_cell_assignment.csv file contains the following important column:

  • ensemblex_assignment: Ensemblex sample labels after performing ensemble-independent doublet detection.

NOTE: Prior to using Ensemblex's sample labels for downstream analyses, we recommend computing the Ensemblex singlet confidence score to identify low confidence singlet assignments that should be removed from the dataset to mitigate the introduction of technical artificats.


Singlet confidence score

After computing the Ensemblex singlet confidence score, the /confidence folder will contain the following file:

working_directory
└── ensemblex
    └── confidence
        └── ensemblex_final_cell_assignment.csv


Output type Name Description
File ensemblex_final_cell_assignment.csv Data file containing Ensemblex's final sample labels after computing the singlet confidence score.

The ensemblex_final_cell_assignment.csv file contains the following important column:

  • ensemblex_assignment: Ensemblex sample labels after applying the recommended singlet confidence score threshold; singlets with a confidence score < 1 are labeled as "unassigned".
  • ensemblex_best_assignment: Ensemblex's best guess assignments with out applying the recommended confidence score threshold; singlets with a confidence score < 1 will still show the best sample label and will not be labelled as "unassigned".
  • ensemblex_singlet_confidence: Ensemblex singlet confidence score.

NOTE: We recommend using the sample labels from ensemblex_assignment for downstream analyses.