baseqSNV: Variant Calling Commands

This package offers efficient and easy-to-use pipelines for Variant Discovery in High-Throughput Sequencing Data. The pipeline is based on GATK and provided an easy to use warpper on the raw commands.

Install

Install the baseqSNV package from PyPi with:

pip install baseqSNV

Tip

The package is updated often. To ensure that you have the latest version, run pip install -U baseqSNV again.

Configs

The following softwares should be installed:

  1. BWA (version >=0.7)
  2. samtools (version >=1.9)
  3. java (version >=1.8)
  4. GATK (version >=4.0.0)
  5. Picard (As a jar file, latest version)

The genome and BWA index should be provided. If you choose to use hg38:

  1. genome file of genome hg38 in fasta format
  2. bwa index of hg38 referece

The following resources are required and helpful for calling SNVs, which can be downloaded from GATK official website.

  1. dbSNP (version 138)
  2. SNP of 1000genome project
  3. INDEL of 1000genome project

The paths of all the dependencies should be written to a config file (name as config.ini, for example):

[SNV]
temp_dir = ./tmp
samtools = /path/to/samtools
bwa = /path/to/bwa
java = /path/to/java
GATK = /path/to/gatk-4.0.3.0/gatk
picard = /path/to/picard.jar

[SNV_ref_hg38]
genome = /path/to/hg38.fa
hg38_bwa_index = /path/to/bwaindex/hg38.fa
dbSNP = /path/to/dbsnp_138.hg38.vcf.gz
SNP = /path/to/1000G_phase1.snps.high_confidence.hg38.vcf.gz
INDEL = /path/to/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

Pipeline

GATK The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic short variant calling, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data, and bundles the popular Picard toolkit.

These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy. [1_]

MuTect2 MuTect2 is a somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect (Cibulskis et al., 2013) with the assembly-based machinery of HaplotypeCaller. [2]

Quality control index

  • Sequencing depth;
  • Mapping ratio;
  • Coverage depth distribution;
  • Enrichment efficiency;

Usage

The interval file

This pipeline is used for panel enriched sequencing and whole exome sequencing. For each sample, you should provide a interval file (bed format) like (name as interval.bed for example):

chr10   180010  180130
chr10   209848  209968
chr10   209968  210088
...

An typical command includes:

#Alignment
baseqSNV align -1 test.1.fq.gz -2 test.2.fq.gz -n Sample -c config.ini -g hg38

#MarkDuplication
baseqSNV markdup -b Sample.bam -m Sample.marked.bam -d ./tmp -c config.ini

#BQSR
baseqSNV bqsr -m Sample.marked.bam -g hg38 -q Sample.marked.bqsr.bam -i interval.bed -c config.ini

#Call Varients
baseqSNV callvar -q Sample.marked.bqsr.bam -r Sample.raw.indel.snp.vcf -g hg38 -c config.ini -i interval.bed

Thus we can get the raw VCF file “Sample.raw.indel.snp.vcf”.