Step 1: FASTQ to gene expression matrix
In Step 1, gene expression matrices are generated from FASTQ files using the CellRanger counts pipeline. Prior to running CellRanger, library.csv
files must be prepared to define the FASTQ files for each sample. In addition, feature_ref.csv
files must be prepared for the HTO track to define the HTOs used in the experiment. ScRNAbox provides an option for automating library preparation but the correct information must still be defined in the parameters. Alternatively, users may manually prepapre the libraries. For a tutorial on manual library preparation, please see the tutorial.
Library preparation
library.csv
The library.csv
file defines the necessary information of the FASTQ files for the experiment, including the gene expression and antibody assays. The structure of the library.csv
file should be:
fastqs,sample,library_type
path/to/fastqs/directory/,SampleNameGEX,Gene Expression
path/to/fastqs/directory/,SampleNameHTO,Antibody Capture
- The
fastqs
column defines the path to the directory that contains the FASTQ files for the experiment.
- The
sample
column defines the sample name of the corresponding FASTQ file. Please note that FASTQ files must be named according to standard CellRanger nomenclature (e.g. CTRL1_S1_L001_R1_001.fastq). For more information regarding the nomenclature please visit CellRanger's documentation. - The
library_type
column defines the assay type. For the standard analysis track, this will always be "Gene expression". For the HTO analysis track, each sequencing run should have a "Gene Expression" and "Antibody Capture" assay.
For more information regarding the preparation of the library.csv
, please see CellRanger's documentation.
feature_ref.csv
The feature_ref.csv
file defines the necessary information for processing HTOs that will eventually be used to demultiplex the pooled samples. For example, if there are four samples pooled together with four unique HTOs, the structure of the feature_ref.csv
file should be:
id,name,read,pattern,sequence,feature_type
Hash1,B0251_TotalSeqB,R2,5PNNNNNNNNNN(BC),GTCAACTCTTTAGCG,Antibody Capture
Hash2,B0252_TotalSeqB,R2,5PNNNNNNNNNN(BC),TGATGGCCTATTGGG,Antibody Capture
Hash3,B0253_TotalSeqB,R2,5PNNNNNNNNNN(BC),TTCCGCCTCTCTTTG,Antibody Capture
Hash4,B0254_TotalSeqB,R2,5PNNNNNNNNNN(BC),AGTAAGTTCAGCGTA,Antibody Capture
- The
id
column defines the barcode ID which will be used to track the feature counts.
- The
name
column defines the arbitrary name for the barcode identifier.
- The
read
column defines which RNA sequencing read contains the barcode sequence. This value Will be either R1 or R2.
- The
pattern
column defines the pattern of the barcode identifiers.
- The
sequence
column defines nucleotide sequence associated with the barcode identifier.
- The
feature_type
column defines the type of feature used for sample identification. Please ensure that the feature_type in thefeature_ref.csv
file matches a library_type in thelibrary.csv
file.
For more information regarding the preparation of the feature_ref.csv
, please see CellRanger's documentation.
Running Step 1
The following parameters are adjustable for Step 1 of the standard track (~/working_directory/job_info/parameters/step1_par.txt
):
Parameter | Default | Description |
---|---|---|
par_automated_library_prep | No | Whether or not to perform automated library prep. Alternatively, you may set this parameter to "no" and manually prepare the libraries. |
par_fastq_directory | NULL | Path to directory containing the FASTQ files. This directory should only contain FASTQ files for the experiment. |
par_sample_names | NULL | The sample names used to name the FASTQ files according to CellRanger nomeclature |
par_rename_samples | Yes | Whether or not you want to rename your samples. These names will be used to identify cells in the Seurat objects |
par_new_sample_names | NULL | New sample names. Make sure they are defined in the same order as 'par_sample_names' |
par_paired_end_seq | Yes | Whether or not paired-end sequencing was performed |
par_ref_dir_grch | NULL | Path to reference genome for FASTQ alignment. 10X Genomics reference genomes are available for download. For more information see the 10X Genomics documentation. |
par_r1_length | NULL | Minimum number of bases to retain for R1 sequence of gene expression |
par_r2_length | NULL | Minimum number of bases to retain for R2 sequence of gene expression |
par_mempercode | 30 | For clusters whose job managers do not support memory requests, it is possible to request memory in the form of cores. This option will scale up the number of threads requested via the MRO_THREADS variable according to how much memory a stage requires when given to the ratio of memory on your nodes. |
par_include_introns | No | Whether or not to include intronic reads in the gene expression matrix |
par_no_target_umi_filter | No | Whether or not to tirn of CellRanger's target UMI filtering subpipeline |
par_expect_cells | NULL | Expected number of cells. By default, CellRanger's auto-estimate algorithm will be used |
par_force_cells | NULL | Force the CellRanger count ipeline to use N cells. |
par_no_bam | No | Whether or not to skip the bam file generation in the CellRanger pipeline. |
The following parameters are adjustable for Step 1 of the HTO track (~/working_directory/job_info/parameters/step1_par.txt
):
Parameter | Default | Description |
---|---|---|
par_automated_library_prep | Yes | Whether or not to perform automated library prep. Alternatively, you may set this parameter to "no" and manually prepare the libraries. |
par_fastq_directory | NULL | Path to directory containing the FASTQ files. This directory should only contain FASTQ files for the experiment. |
par_RNA_run_names | NULL | The names of the sequencing runs for the RNA assay |
par_HTO_run_names | NULL | The names of the sequencing runs for the HTO assay |
par_seq_run_names | NULL | The user-selected name for the sequencing run. These names will be used to identify cells in the Seurat objects |
par_paired_end_seq | Yes | Whether or not paired-end sequencing was performed |
id | NULL | Barcode ID which will be used to track the feature counts |
name | NULL | The user-selected name for the barcode identifier |
read | R2 | Which RNA sequencing read contains the barcode sequence. This value Will be either R1 or R2. |
pattern | NULL | The pattern of the barcode identifiers |
sequence | NULL | The nucleotide sequence associated with the barcode identifier |
par_ref_dir_grch | NULL | Path to reference genome for FASTQ alignment. 10X Genomics reference genomes are available for download. For more information see the 10X Genomics documentation. |
par_r1_length | NULL | Minimum number of bases to retain for R1 sequence of gene expression |
par_r2_length | NULL | Minimum number of bases to retain for R2 sequence of gene expression |
par_mempercode | 30 | For clusters whose job managers do not support memory requests, it is possible to request memory in the form of cores. This option will scale up the number of threads requested via the MRO_THREADS variable according to how much memory a stage requires when given to the ratio of memory on your nodes. |
par_include_introns | No | Whether or not to include intronic reads in the gene expression matrix |
par_no_target_umi_filter | No | Whether or not to tirn of CellRanger's target UMI filtering subpipeline |
par_expect_cells | NULL | Expected number of cells. By default, CellRanger's auto-estimate algorithm will be used |
par_force_cells | NULL | Force the CellRanger count ipeline to use N cells. |
par_no_bam | No | Whether or not to skip the bam file generation in the CellRanger pipeline. |
Given that CellRanger runs a user interface, it is recommended to run Step 1 in a 'screen' which will allow the the task to keep running if the connection is broken. To run Step 1, use the following command:
screen -S run_scrnabox
bash $SCRNABOX_HOME/launch_scrnabox.sh \
-d ${SCRNABOX_PWD} \
--steps 1
The resulting output files are deposited into ~/working_directory/step1
. The expression filtered matrix, features, and barcode files outputed by CellRanger are located in ~/working_directory/step1/sample/ouput_folder/outs/filtered_feature_bc_matrix
.
Note: If CellRanger was successfull it will display Pipestance completed successfully! If this message is not displayed, you should check the error logs in ~/working_directory/step1/sample/ouput_folder.log
and re-run Step 1. If CellRanger fails a second time, please contact the developers of scRNAbox. Contact information is available here.
Note: If you do not have access to FASTQ files for your experiment, you may intiate the pipeline at which ever Analytical Step takes your data object as input. In the case where FASTQ files are not available, users do not have to create the samples_info
folder.