SNV

SNV analysis functions.

digraph GATK {
    rankdir=LR;
    node [shape=box,style=filled,fillcolor="#C0D0C0"];
    subgraph clusterClient {
       label="Align"; style=filled; bgcolor="#D0C0A0";
       "BWA";
    };
    subgraph clusterServer {
       label="GATK"; style=filled; bgcolor="#D0C0A0";
       "MarkDup" -> "BQSR" -> "Haplocaller" -> "SelectVarient";
    };
    "BWA" -> "MarkDup";
 }

Quality Control

quality_enrich(name, bampath, intervals) Check the coverage depth and enrichment quality.
quality_enrich_multiple([bamlist, name, …]) Check the coverage depth and enrichment quality for multiple bamfiles, export a Excel File.

GATK

run_gatkpipe(config, dir, fq1, fq2, bamfile, …) Run GATK, including bqsr/callvar/selectvar/annovar
run_multi_gatkpipe(config, genome, list, …) Run GATK pipeline for multiple fastq or bam file list.
alignment(fq1, fq2, sample, genome[, …]) Map FASTQ files to genome using BWA aligner.
run_markdup(bamfile, markedbam, memory, …) Run MarkDuplicates of Picard.
bqsr(markedbam, bqsrbam, genome, config, …) Run BQSR.
run_callvar(bqsrbam, rawvcf, genome[, …]) Call germline SNPs and indels via local re-assembly of haplotypes.
selectvar(rawvcf, selectvcf, filtervcf, genome) This function selects SNPs from a VCF file which is usually the output file of HaplotypeCaller.
mutect2(genome, normalname, normalbam, …) Mutect2 is aim to call somatic SNVs and indels via local assembly of haplotypes.
create_pon(genome, list, path, interval) Create_pon function helps you create PoN(panel of normals) file necessary for mutect2 calling.

Annotation

run_annovar(filtervcf, annovarfile, name, …) Annotation VCF files generated by HaplotypeCaller or Mutect2, VCF files are first converted to stardard input format of annovar.

Defination

baseqSNV.quality_enrich(name, bampath, intervals)[source]

Check the coverage depth and enrichment quality.

from baseqSNV.qc.enrich import quality
quality("sample01", "aligned.bam", "panel.bed")
Returns:Sample/Total/Mapped/Map_Ratio/Dup_ratio/PCT_10X/PCT_30X/…

Todo

  • For module TODOs
  • Median Coverage…
  • Estimate on distribution bias/ Gini index?
baseqSNV.quality_enrich_multiple(bamlist='', name='', bampath='', intervals='')[source]

Check the coverage depth and enrichment quality for multiple bamfiles, export a Excel File. Usage

#bamlist.txt:
sample01 sample01.bam
sample02 sample02.bam
...

from baseqSNV.qc.enrich import quality_multiple
quality_multiple(bamlist = "bamlist.txt", intervals = "panel.bed")

Result:

QC.xls
http://p8v379qr8.bkt.clouddn.com/SNV.quality.jpg
baseqSNV.run_gatkpipe(config, dir, fq1, fq2, bamfile, name, genome, memory, step='all')[source]

Run GATK, including bqsr/callvar/selectvar/annovar

The pipeline contains following steps:

  1. Alignment using BWA.
  2. MarkDuplicated Using Picard.
  3. BQSR.
  4. Haplocaller.
  5. SelectVarient.

A config file should be provided:

:param config: config file;
:param dir: outdir path, the result will be dir/name;
:param step: all/align/markdup/bqsr/callvar/annovar;

Examples:

#In commandline:


#As function...
from baseqSNV.gatk.pipe import run_gatkpipe
run_gatkpipe("./config.ini", )
baseqSNV.run_multi_gatkpipe(config, genome, list, dir, thread, memory_size, step='all')[source]

Run GATK pipeline for multiple fastq or bam file list.

baseqSNV.alignment(fq1, fq2, sample, genome, thread=8, config='')[source]

Map FASTQ files to genome using BWA aligner.

Add ReadGroup to bamfile using the input sample name. Outfile is in BAM format and indexed for the downstream analysis.

A bam file and its index will be genrated:

<sample>.bam
<sample>.bam.bai

Usage:

A config file should be provided:
[SNV]
bwa = /home/soft/bin/bwa
samtools = /home/soft/bin/samtools

[SNV_ref_hg19]
bwa_index = /home/ref/hg19.fa

#Run using the command line:
baseqSNV align -1 test.1.fq.gz -2 test.2.fq.gz -n Sample -c config.ini -g hg38

#Call as function:
from baseqSNV import alignment
alignment("sample.1.fq.gz", "sample.2.fq.gz", "Sample", "hg19", "config.ini")
baseqSNV.run_markdup(bamfile, markedbam, memory, tmpdir, config)[source]

Run MarkDuplicates of Picard. this function tags duplicate reads with “markduplicate” in BAM file. See also MarkDuplicates in gatk.

The config file should be provided:

[SNV]
java = /path/to/java
picard = /path/to/picard.jar

Usage:

baseqSNV markdup -b Sample.bam -m Sample.marked.bam -d ./tmp -c config.ini

Return: metrics file indicates the numbers of duplicates for both single- and paired-end reads

Sample.marked.bam
Sample.marked.bam.bai
Sample.marked.bam.metrics
baseqSNV.bqsr(markedbam, bqsrbam, genome, config, interval, disable_dup_filter=False)[source]

Run BQSR. This function performs the two-steps process called base quality score recalibration. the first ster generates a recalibration table based on various covariates which is recruited to the second step to correct the systematic bias in input BAM file. More details about BaseRecalibrator and ApplyBQSR .

Usage:

  • Default mode filters duplicate reads (reads with “markduplicate” tags) before applying BQSR

    baseqSNV bqsr -m Sample.marked.bam -g hg38 -q Sample.marked.bqsr.bam
    
  • Disable reads filter before analysis.

    baseqSNV bqsr -m Sample.marked.bam -g hg38 -q Sample.marked.bqsr.bam -f Yes
    

Return:

Sample.marked.bam.table
Sample.marked.bqsr.bai
Sample.marked.bqsr.bam
baseqSNV.run_callvar(bqsrbam, rawvcf, genome, disable_dup_filter=False)[source]

Call germline SNPs and indels via local re-assembly of haplotypes. “bqsrbam” file recalbrated by BQSR do recommand as input BAM file and this functin only run the single sample genotypeVCF calling. More details see also HaplotypeCaller

Usage:

baseq-SNV run_callvar -q Test.marked.bqsr.bam -r Test.raw.indel.snp.vcf -g hg38

Return:

Test.raw.indel.snp.vcf
baseqSNV.selectvar(rawvcf, selectvcf, filtervcf, genome, run=True)[source]

This function selects SNPs from a VCF file which is usually the output file of HaplotypeCaller. Then, all SNPs are filtered by certain criteria based on INFO and/or FORMAT annotations. Criteria used here is “QD < 2.0 || FS > 60.0 || MQ < 40.0”. More details about SelectVariants and VariantFiltration

Usage:

baseq-SNV run_selectvar -r Test.raw.indel.snp.vcf -s Test.raw.snp.vcf -f Test.filtered.snp.vcf -g hg38

Return:

Test.raw.snp.vcf
Test.filtered.snp.vcf
baseqSNV.mutect2(genome, normalname, normalbam, tumorname, tumorbam, vcffile, pon, germline)[source]

Mutect2 is aim to call somatic SNVs and indels via local assembly of haplotypes. This function requires both tumor BAM file and its matched normal BAM file. tumorname and normalname should be consistent with the ReadGroup(ID) of tumor BAM file and normal BAM file respectively. PoN is refer to panel of normals callset(more infomation about PoN and how to create it, please see PoN ). Germline resource, also in VCF format, is used to annotate variant alleles. Download germline resource.

Usage:

  • Simplified Mutect2 command line

    # single sample
    baseq-SNV run_mutect2 -g hg37 -n normal -N normal_marked_bqsr.bam \
                                  -t tumor -T tumor_marked_bqsr.bam -o ./
    # multiple samples
    baseq-SNV run_mutect2 -g hg37 -l sample_list.txt -o ./
    
  • Specify PoN(panels of normals) VCF file and germline VCF file. Default germline VCF file comes form gatk resource bundle and is recruited if germline isn’t designated.

    # single sample
    baseq-SNV run_mutect2 -g hg37 -n normal -N normal_marked_bqsr.bam \
                                  -t tumor -T tumor_marked_bqsr.bam -o ./ \
                                  -p pon.vcf.gz -G af-only-gnomad.raw.sites.b37.vcf.gz
    # multiple samples
    baseq-SNV run_mutect2 -g hg37 -l sample_list.txt -o ./ \
                                  -p pon.vcf.gz -G af-only-gnomad.raw.sites.b37.vcf.gz
    
baseqSNV.create_pon(genome, list, path, interval)[source]

Create_pon function helps you create PoN(panel of normals) file necessary for mutect2 calling. The PoN captures common artifactual and germline variant sites. Mutect2 then uses the PoN to filter variants at the site-level.

Example of samples list (tab delimited):

  • Content of columns are normal sample name, normal BAM file, tumor sample name, tumor BAM file(order cannot be distruped). Absoulte path of all BAM files should be added if directory of BAM files and analysis directory are different.

    N504    N504_marked_bqsr.bam   T504    T504_marked_bqsr.bam
    N505    N505_marked_bqsr.bam   T505    T505_marked_bqsr.bam
    N506    N506_marked_bqsr.bam   T506    T506_marked_bqsr.bam
    N509    N509_marked_bqsr.bam   T509    T509_marked_bqsr.bam
    N510    N510_marked_bqsr.bam   T510    T510_marked_bqsr.bam
    

Usage:

  • Interval list defines genomic regions where analysis is restricted. Introduction of interval list format and its function can be found this page.

    # designated a intervals.list
    baseq-SNV create_pon -g hg37 -l sample_list.txt -p ./ -L interval.list
    # Using the dafalut intervals.list
    baseq-SNV create_pon -g hg37 -l sample_list.txt -p ./
    
baseqSNV.run_annovar(filtervcf, annovarfile, name, genome, outdir, run=True)[source]

Annotation VCF files generated by HaplotypeCaller or Mutect2, VCF files are first converted to stardard input format of annovar. Annovar can annotate genetic variants using diverse gene definition database,such as 1000Genome Project, dbSNP annotations,

Usage:

baseq-SNV run_annovar -f hg38 -f sample01.vcf -n sample01 -d ./

Return:

|---sample01.avinput
|---sample01.hg38_multianno.csv