Adjustable execution parameters for the Ensemblex pipeline
- Introduction
- How to modify the parameter files
- Constituent genetic demultiplexing tools with prior genotype information
- Constituent genetic demultiplexing tools without prior genotype information
- Ensemblex algorithm
Introduction
Prior to running the Ensemblex pipeline, users should modify the execution parameters for the constituent genetic demultiplexing tools and the Ensemblex algorithm. Upon running Step 1: Set up, a /job_info
folder will be created in the wording directory. Within the /job_info
folder is a /configs
folder which contains the ensemblex_config.ini
; this .ini file contains all of the adjustable parameters for the Ensemblex pipeline.
working_directory
└── job_info
├── configs
│ └── ensemblex_config.ini
├── logs
└── summary_report.txt
To ensure replicability, the execution parameters are documented in ~/working_directory/job_info/summary_report.txt
.
How to modify the parameter files
The following section illustrates how to modify the ensemblex_config.ini
parameter file directly from the terminal. To begin, navigate to the /configs
folder and view its contents:
cd ~/working_directory/job_info/configs
ls
The following file will be available: ensemblex_config.ini
To modify the ensemblex_config.ini
parameter file directly in the terminal we will use Nano:
nano ensemblex_config.ini
This will open ensemblex_config.ini
in the terminal and allow users to modify the parameters. To save the modifications and exit the parameter file, type ctrl+o
followed by ctrl+x
.
Constituent genetic demultiplexing tools with prior genotype information
Demuxalot
The following parameters are adjustable for Demuxalot:
Parameter | Default | Description |
---|---|---|
PAR_demuxalot_genotype_names | NULL | List of Sample ID's in the sample VCF file (e.g., 'Sample_1,Sample_2,Sample_3'). |
PAR_demuxalot_minimum_coverage | 200 | Minimum read coverage. |
PAR_demuxalot_minimum_alternative_coverage | 10 | Minimum alternative read coverage. |
PAR_demuxalot_n_best_snps_per_donor | 100 | Number of best snps for each donor to use for demultiplexing. |
PAR_demuxalot_genotypes_prior_strength | 1 | Genotype prior strength. |
PAR_demuxalot_doublet_prior | 0.25 | Doublet prior strength. |
Demuxlet
The following parameters are adjustable for Demuxlet:
Parameter | Default | Description |
---|---|---|
PAR_demuxlet_field | GT | Field to extract the genotypes (GT), genotype likelihood (PL), or posterior probability (GP) from the sample .vcf file. |
NOTE: We are currently working on expanding the execution parameters for Demuxlet.
Vireo
The following parameters are adjustable for Vireo:
Parameter | Default | Description |
---|---|---|
PAR_vireo_N | NULL | Number of pooled samples. |
PAR_vireo_type | GT | Field to extract the genotypes (GT), genotype likelihood (PL), or posterior probability (GP) from the sample .vcf file. |
PAR_vireo_processes | 20 | Number of subprocesses for computing. |
PAR_vireo_minMAF | 0.1 | Minimum minor allele frequency. |
PAR_vireo_minCOUNT | 20 | Minimum aggregated count. |
PAR_vireo_forcelearnGT | T | Whether or not to treat donor GT as prior only. |
NOTE: We are currently working on expanding the execution parameters for Vireo.
Souporcell
The following parameters are adjustable for Souporcell:
Parameter | Default | Description |
---|---|---|
PAR_minimap2 | -ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no | For information regarding the minimap2 parameters, please see the documentation. |
PAR_freebayes | -iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6 | For information regarding the freebayes parameters, please see the documentation. |
PAR_vartrix_umi | TRUE | Whether or no to consider UMI information when populating coverage matrices. |
PAR_vartrix_mapq | 30 | Minimum read mapping quality. |
PAR_vartrix_threads | 8 | Number of threads for computing. |
PAR_souporcell_k | NULL | Number of pooled samples. |
PAR_souporcell_t | 8 | Number of threads for computing. |
NOTE: We are currently working on expanding the execution parameters for Souporcell.
Constituent genetic demultiplexing tools without prior genotype information
Demuxalot
The following parameters are adjustable for Demuxalot:
Parameter | Default | Description |
---|---|---|
PAR_demuxalot_genotype_names | NULL | List of Sample ID's in the sample VCF file generated by Freemuxlet: outs.clust1.vcf (e.g., 'CLUST0,CLUST1,CLUST2'). |
PAR_demuxalot_minimum_coverage | 200 | Minimum read coverage. |
PAR_demuxalot_minimum_alternative_coverage | 10 | Minimum alternative read coverage. |
PAR_demuxalot_n_best_snps_per_donor | 100 | Number of best snps for each donor to use for demultiplexing. |
PAR_demuxalot_genotypes_prior_strength | 1 | Genotype prior strength. |
PAR_demuxalot_doublet_prior | 0.25 | Doublet prior strength. |
Freemuxlet
The following parameters are adjustable for Freemuxlet:
Parameter | Default | Description |
---|---|---|
PAR_freemuxlet_nsample | NULL | Number of pooled samples. |
NOTE: We are currently working on expanding the execution parameters for Freemuxlet.
Vireo
The following parameters are adjustable for Vireo:
Parameter | Default | Description |
---|---|---|
PAR_vireo_N | NULL | Number of pooled samples. |
PAR_vireo_processes | 20 | Number of subprocesses for computing. |
PAR_vireo_minMAF | 0.1 | Minimum minor allele frequency. |
PAR_vireo_minCOUNT | 20 | Minimum aggregated count. |
NOTE: We are currently working on expanding the execution parameters for Vireo.
Souporcell
The following parameters are adjustable for Souporcell:
Parameter | Default | Description |
---|---|---|
PAR_minimap2 | -ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no | For information regarding the minimap2 parameters, please see the documentation. |
PAR_freebayes | -iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6 | For information regarding the freebayes parameters, please see the documentation. |
PAR_vartrix_umi | TRUE | Whether or no to consider UMI information when populating coverage matrices. |
PAR_vartrix_mapq | 30 | Minimum read mapping quality. |
PAR_vartrix_threads | 8 | Number of threads for computing. |
PAR_souporcell_k | NULL | Number of pooled samples. |
PAR_souporcell_t | 8 | Number of threads for computing. |
NOTE: We are currently working on expanding the execution parameters for Souporcell.
Ensemblex
The following parameters are adjustable for the Ensemblex algorithm:
Parameter | Default | Description |
---|---|---|
Pool parameters | ||
PAR_ensemblex_sample_size | NULL | Number of samples multiplexed in the pool. |
PAR_ensemblex_expected_doublet_rate | NULL | Expected doublet rate for the pool. If using 10X Genomics, the expected doublet rate can be estimated based on the number of recovered cells. For more information see 10X Genomics Documentation. |
Set up parameters | ||
PAR_ensemblex_merge_constituents | Yes | Whether or not to merge the output files of the constituent demultiplexing tools. If running Ensemblex on a pool for the first time, this parameter should be set to "Yes". Subsequent runs of Ensemblex (e.g., parameter optimization) can have this parameter set to "No" as the pipeline will automatically detect the previously generated merged file. |
Step 1 parameters: Probabilistic-weighted ensemble | ||
PAR_ensemblex_probabilistic_weighted_ensemble | Yes | Whether or not to perform Step 1: Probabilistic-weighted ensemble. If running Ensemblex on a pool for the first time, this parameter should be set to "Yes". Subsequent runs of Ensemblex (e.g., parameter optimization) can have this parameter set to "No" as the pipeline will automatically detect the previously generated Step 1 output file. |
Step 2 parameters: Graph-based doublet detection | ||
PAR_ensemblex_preliminary_parameter_sweep | No | Whether or not to perform a preliminary parameter sweep for Step 2: Graph-based doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define the number of confident doublets in the pool (nCD) and the percentile threshold of the nearest neighour frequency (pT), which can be defined in the following two parameters, respectively. |
PAR_ensemblex_nCD | NULL | Manually defined number of confident doublets in the pool (nCD). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to "Yes". |
PAR_ensemblex_pT | NULL | Manually defined percentile threshold of the nearest neighour frequency (pT). Value can be informed by the output files generated by setting PAR_ensemblex_preliminary_parameter_sweep to "Yes". |
PAR_ensemblex_graph_based_doublet_detection | Yes | Whether or not to perform Step 2: Graph-based doublet detection. If PAR_ensemblex_nCD and PAR_ensemblex_pT are not defined by the user (NULL), Ensemblex will automatically determine the optimal parameter values using an unsupervised parameter sweep. If PAR_ensemblex_nCD and PAR_ensemblex_pT are defined by the user, graph-based doublet detection will be performed with the user-defined values. |
Step 3 parameters: Ensemble-independent doublet detection | ||
PAR_ensemblex_preliminary_ensemble_independent_doublet | No | Whether or not to perform a preliminary parameter sweep for Step 3: Ensemble-independent doublet detection. Users should utilize the preliminary parameter sweep if they wish to manually define which constituent tools to utilize for ensemble-independent doublet detection. Users can define which tools to utilize for ensemble-independent doublet detection in the following parameters. |
PAR_ensemblex_ensemble_independent_doublet | Yes | Whether or not to perform Step 3: Ensemble-independent doublet detection. |
PAR_ensemblex_doublet_Demuxalot_threshold | Yes | Whether or not to label doublets identified by Demuxalot as doublets. Only doublets with assignment probabilities exceeding Demuxalot's recommended probability threshold will be labeled as doublets by Ensemblex. |
PAR_ensemblex_doublet_Demuxalot_no_threshold | No | Whether or not to label doublets identified by Demuxalot as doublets, regardless of the corresponding assignment probability. |
PAR_ensemblex_doublet_Demuxlet_threshold | No | Whether or not to label doublets identified by Demuxlet as doublets. Only doublets with assignment probabilities exceeding Demuxlet's recommended probability threshold will be labeled as doublets by Ensemblex. |
PAR_ensemblex_doublet_Demuxlet_no_threshold | No | Whether or not to label doublets identified by Demuxlet as doublets, regardless of the corresponding assignment probability. |
PAR_ensemblex_doublet_Souporcell_threshold | No | Whether or not to label doublets identified by Souporcell as doublets. Only doublets with assignment probabilities exceeding Souporcell's recommended probability threshold will be labeled as doublets by Ensemblex. |
PAR_ensemblex_doublet_Souporcell_no_threshold | No | Whether or not to label doublets identified by Souporcell as doublets, regardless of the corresponding assignment probability. |
PAR_ensemblex_doublet_Vireo_threshold | Yes | Whether or not to label doublets identified by Vireo as doublets. Only doublets with assignment probabilities exceeding Vireo's recommended probability threshold will be labeled as doublets by Ensemblex. |
PAR_ensemblex_doublet_Vireo_no_threshold | No | Whether or not to label doublets identified by Vireo as doublets, regardless of the corresponding assignment probability. |
Confidence score parameters | ||
PAR_ensemblex_compute_singlet_confidence | Yes | Whether or not to compute Ensemblex's singlet confidence score. This will define low confidence assignments which should be removed from downstream analyses. |