Segregation Analysis

[Segregation Analysis]
[Hail]
[Steps of running]

Segregation Analysis

This is a process to explore the distribution of genetic variants among families with <=2 phenotypes or disease states. This analysis breaks down counts of samples by genotype states (heterozygous reference, heterozygous, homozygous variant, no call), by disease status (affected vs nonaffected) and by family scope (in-family or global). Each output line represents one variant in one family. Additionally, since segregation analysis outputs are tab-separated, this program can be used to convert vcf to tabular format, including any user-defined annotations present in the input vcf file.

This program requires two inputs: 1) a pedigree file with six tab-separated columns in the order:

familyid
individualid
parentalid
maternalid
sex {1:male; 2:female, 0:unknown}
phenotype {1: unaffected, 2: affected, -9:missing}

as well as a single header line with the exact values above.

NOTE: if you wish to simply convert vcf to a tab-separated file, simply provide a pedigree file with only a single familyid and phenotype

2) Variant call data in vcf format (compressed with gzip/bgzip or uncompressed)

Hail

The “seganalysis” Python module is developed on top of Hail. Hail is a Python module built on the top of Apache Spark and serves as an open-source, scalable framework for genomic data manipulation and analysis. The seganalysis module was tested on Spark-3.1.2, Pre-build for Apache Hadoop3.2.

Steps of running pipeline

The following steps show how to run the segregation pipeline.

Step 1: Run Spark

First activate Spark on your system

export SPARK_HOME=$HOME/spark-3.1.2-bin-hadoop3.2
export SPARK_LOG_DIR=$HOME/temp
module load java/11.0.2
cd ${SPARK_HOME}; ./sbin/start-master.sh

Step 2: Generate annotated file

Run VEP to generate the annotated file.

Step 3: Create hail matrixTable

Next, initialize hail and import your vcf file and write it as a matrix table, a data structure optimized to efficiently store and manipulate genomic data. In the code snippet below, we import a vcf file and write it as a MatrixTable file, then read the newly-created file into a MatrixTable object. The test data used below is available here: [https://github.com/The-Neuro-Bioinformatics-Core/seganalysis/test].

import sys
import pandas as pd 
import hail as hl
hl.import_vcf('./test/data/testseg.VEP.vcf',force=True,reference_genome='GRCh38',array_elements_required=False).write('./test/output/testseg.mt', overwrite=True)
mt = hl.read_matrix_table('./test/output/testseg.mt')

Step 4: Run the seganalysis module

from seganalysis import seg
ped = pd.read_csv('./test/data/testseg.ped', sep='\t')
destfolder = './test/output'
vcffile = './test/data/testseg.VEP.vcf'
seg.segrun(mt, ped, destfolder, hl, vcffile)    

This will generate two files: header.txt and finalseg.csv, in the destfolder. header.txt includes the header of information in finalseg.csv. Columns of finalseg.csv are organized into the following sections: 1) locus and alleles, 2) CSQ, 3) Global- Non-Affected 4) Global-Affected, 5) Family, 6) Family-Affected 7) Family - Non-affected.

locus and alleles

locus: chromosome and position
alleles: REF allele and ALT allele(s)

CSQ

VEP places all annotations within the “CSQ” vcf INFO field. Running seg.segrun() will split CSQ into tab-separated columns.

Global - Non-Affected

glb_naf_wild: Global - Non-Affecteds, wildtype
glb_naf_ncl: Global - Non-Affecteds, no call

glb_naf_vrt: Global - Non-Affecteds, with variant
glb_naf_homv: Global - Non-Affecteds, homozygous for ALT allele
glb_naf_altaf: Global - Non-Affecteds, ALT allele frequency

Global - Affected

glb_aff_wild: Global - Affecteds, wildtype
glb_aff_ncl: Global - Affecteds, no call
glb_aff_vrt: Global - Affecteds, with variant
glb_aff_homv: Global - Affecteds, homozygous for ALT allele
glb_aff_altaf: Global - Affecteds, ALT allele frequency

Family

{famid}_wild: Family - all: wildtype
{famid}_ncl: Family - all: no call
{famid}_vrt: Family - all: with variant
{famid}_homv: Family - all: homozygous for ALT allele

Family - Affected

{famid}_wild_aff: Family - Affecteds: wildtype
{famid}_ncl_aff: Family - Affecteds: no call
{famid}_vrt_aff: Family - Affecteds: with variant
{famid}_homv_aff: Family - Affecteds: homozygous for ALT allele

Family - Nonaffected

{famid}_wild_naf: Family - Nonaffecteds: wildtype
{famid}_ncl_naf: Family - Nonaffecteds: no call
{famid}_vrt_naf: Family - Nonaffecteds: with variant
{famid}_homv_naf: Family - Nonaffecteds: homozygous for ALT allele

Step 5 (optional): Cleanup and parsing

finalseg.csv can be cleaned up into a more excel-friendly format, better-suited to use excel column filters. Additionally, unwanted output columns can be removed by subsetting the header.txt file into a new file containing only the desired output columns (header_need.txt in the code snippet below). To clean without removing columns, simply use the original header.txt file instead.

from seganalysis import parser
header_need = './test/data/header_need.txt'
parser.sub_seg(destfolder, header_need)  

The final output of this step is the file finalseg_modified.csv

Step 6 Shut down spark

Do not forget to deactivate environment and stop spark:

cd ${SPARK_HOME}; ./sbin/stop-master.sh

Bibliography:

Hail-Powered Science. An incomplete list of scientific work enabled by Hail.

The Neuro Bioinformatics Core

Segregation Analysis

Contents