baseqCNV

A Python Package for Analysis of Copy Number Variation

baseqCNV is a toolkit to infer and visualize copy number from high-throughput DNA sequencing data. It is designed for use with Whole Genome Sequencing (WGS) data for both bulk and single cell experiments.

The copy number is based on the reads counts per genomic region. The region are predefined to exclude and discount the low complexity parts. For each sample, the following samples should be provided.

Pipeline Steps

The whole pipeline can be devided five steps. The main working loads including align, bincounting and nomralization which can be performed at local server. The resulting file (“bincounts_norm.txt”) can be uploaded to http://wgs.beiseq.cn/cnv/create for further genome segementation and visualization.

1:Align

Reads Alignment using Bowtie2 (run in Local Server);

2:Bin Counting

Counting the unique mapped reads in each bins (run in Local Server);

3:Normalize

Normalize by GC content (run in Local Server);

4:CBS

Circular Binary Segmentation (CBS) for partitions a genome into segments of constant total copy numbers the similar bins (run in Web Server). It is based on a R package DNACopy (https://www.bioconductor.org/packages/release/bioc/html/DNAcopy.html).

5:Visualization

Visualization, generating the CNV distributions along the whole genome (run in Web Server);

Dependencies

At first, Python3 is required (version >=3.6).

Softwares:

  • Bowtie2: For alignment of raw sequencing reads; ()
  • Samtools: For tansforming and manipulating bam/sam files from aligner (Version >=1.9);

Resources:

  • bowtie2_index: The path to the bowtie2 indexed genome references;
  • dynamic_bin: Genome bins in ~50Kb, the duplication or low complexity regions are excluded;

Download The dynamic_bin files:

Homo Sapain (hg19): http://wgs.beiseq.cn/resources/hg19.dynabin.txt

Mus Musculus (mm10): http://wgs.beiseq.cn/resources/mm10.dynabin.txt

Configuration

The paths of all the dependencies should be written to a config file (name as config.ini, for example):

[CNV]
bowtie2 = /path/to/bowtie2
samtools = /path/to/samtools

[CNV_ref_hg19]
bowtie2_index = /path/to/bowtie2_index/hg19
dynamic_bin = /path/to/hg19.dynabin.txt

Tip

The bowtie2_index path is the prefix of a set of files. For example, if it is set as “/path/to/bowtie2_index/hg19”, there should be files like “hg19.fa/hg19.1.bt2/hg19.2.bt2/…” under the folder: “/path/to/bowtie2_index”.

Install

To install baseqCNV, simply use pip:

pip install baseqCNV

Usage at local server

The pipeline includes three steps at local server.

#Alignment
#It need one fastq file, for pair-end data, pair-end 1 file is OK.
#The path of the sequencing file should be specified after "-1".
#The path of configuration file shoule be specified after "-c".
#The genome nama or version should be specified after "-g".
baseqCNV align -1 Tn5_S1.fq.gz -c config.ini -g hg19

#BinCounting
#The aligned bam file should be specified after "-i".
#The path of configuration file shoule be specified after "-c".
baseqCNV bincount -g hg19 -i ./baseqCNV.bowtie2.sort.bam -o bincounts.txt -c config.ini

#Normalize (The resulting file can be uploade to websrever for visualization)
baseqCNV normalize -g hg19 -i ./bincounts.txt -o bincounts_norm.txt -c config.ini

Web-based Visualization

The normalized bincount file can be uploaded to our webserver http://wgs.beiseq.cn/cnv/create for CBS and visualization.

Here is an example of a normalized bincount file: http://wgs.beiseq.cn/resources/bincounts_norm_example.txt,you can try it.