pibase | QuickStart Tutorial | example pibase_test.sh


The linux bash shell script below interrogates three different types of BAM-files: a SOLiD BFAST BAM file, an FLX SSAHA BAM file, and three ILLUMINA BWA BAM files.

Regarding the pibase_bamref examples, the shell script uses the same four incremental filter options for SOLiD as for ILLUMINA: minimal PHRED-like base quality 20, minimal mapped read length 34, at maximum one mismatch within the mapped 34-nucleotides-or-longer read length, minimal mapping quality MAPQV 20. The shell script uses 34 as the minimal mapped read length, because the example data SOLiD BAM file contains reads of only 35 nucleotides length; in real life, we recommend to use at least 49 for your own BAM files, because some loci - e.g. in human mtDNA HVS-II - are susceptible to mismapping of genomic 35bp-reads and this can be detected more easily when using 49 nucleotides or longer reads.

Regarding the pibase_consensus examples, the shell script demonstrates how to merge three pibase_bamref files (here: from three different sequencing platforms) into one pibase_consensus file. In real life, we expect this merging to be used for files from the same sequencing platform.

Regarding the pibase_fisherdiff examples, the shell script demonstrates how to find differences between two pibase_consensus files. To minimise noise from platform-artefacts or alignment software artefacts or protocol artefacts, we recommend that BAM-files are used from the same platform, the same alignment software (and software settings) and the same wetlab protocols.

#!/bin/bash

###################
# BAM files downloaded from:  ftp.1000genomes.ebi.ac.uk
###################
# The genomic coordinates of interest are taken
# from http://www.1000genomes.org/data
# from file CEU.exon.2010_03.genotypes.vcf
# for chromosome 22
# and transformed from hg18 to hg19 using
# the UCSC Genome Browser Liftover tool.
###################
# ftp-download instructions for rookies:
# linux command line: ftp ftp.1000genomes.ebi.ac.uk
# user: anonymous
# password: your email address
# cd /vol1/ftp/data/NA12878/alignment
# bin
# get NA12878.chrom22.SOLID.bfast.CEU.high_coverage.20100125.bam*
# quit 
###################

mkdir output

###################
# SOLID FR35 BAM file (Daughter)
###################
# STEP 1: parse BAM-file and reference sequence file.
# Options: allow max fragment length (including insertions) of 50,
# min base quality value: 20 (PHRED-like), 
# min "high-quality" mapping length: 34 (nucleotides)
# max "high-quality" mapping mismatches: 1 (nucleotide)
# min "high-quality" mapping Qval: 20 (analogous to PHRED-like)
pibase_bamref hg19_chr22_genomic_coords_of_interest.txt hg19.1000G.quick.fasta NA12878.chrom22.SOLID.bfast.CEU.high_coverage.20100125.bam output/NA12878_SOLID.txt 50 20 34 1 20
# STEP 2: infer "best genotype", "best quality", strand-support
pibase_consensus output/NA12878_SOLID.txt output/gen_NA12878_SOLID.txt

###################
# FLX BAM file (Daughter)
###################
# STEP 1. Same options as above, but lengths 1000 and 100.
pibase_bamref hg19_chr22_genomic_coords_of_interest.txt hg19.1000G.quick.fasta NA12878.chrom22.LS454.ssaha2.CEU.high_coverage.20100311.bam output/NA12878_FLX.txt 1000 20 100 1 20
# STEP 2: infer "best genotype", "best quality", strand-support
pibase_consensus output/NA12878_FLX.txt output/gen_NA12878_FLX.txt

###################
# ILLUMINA BAM files (Daughter, father, mother)
###################
# STEP 1. Same options as above, but lengths 100 and 34.
# Daughter: 
pibase_bamref hg19_chr22_genomic_coords_of_interest.txt hg19.1000G.quick.fasta NA12878.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100311.bam output/NA12878_ILLUMINA.txt 100 20 34 1 20
# her father:
pibase_bamref hg19_chr22_genomic_coords_of_interest.txt hg19.1000G.quick.fasta NA12891.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100517.bam output/NA12891_ILLUMINA.txt 100 20 34 1 20
# her mother:
pibase_bamref hg19_chr22_genomic_coords_of_interest.txt hg19.1000G.quick.fasta NA12892.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100517.bam output/NA12892_ILLUMINA.txt 100 20 34 1 20

# STEP 2: infer "best genotype", "best quality", strand-support
# Daughter: 
pibase_consensus output/NA12878_ILLUMINA.txt output/gen_NA12878_ILLUMINA.txt
# her father:
pibase_consensus output/NA12891_ILLUMINA.txt output/gen_NA12891_ILLUMINA.txt
# her mother:
pibase_consensus output/NA12892_ILLUMINA.txt output/gen_NA12892_ILLUMINA.txt

# STEP 3: compare two files (normally a control/case pair, e.g. normal/tumor tissue)

# compare daughter with her father using Fisher's exact test on the allelic counts 
# (a) Options: min coverage >= 20, p.median <= 0.05, factor <= 10
pibase_fisherdiff output/gen_NA12878_ILLUMINA.txt output/gen_NA12891_ILLUMINA.txt output/diff_gen_NA12878_NA12891_ILLUMINA.txt 20 0.05 10

# (b) Different option: min coverage >= 5  
# (because there are some lower-coverage differences... e.g. a Mendelian error):
pibase_fisherdiff output/gen_NA12878_ILLUMINA.txt output/gen_NA12891_ILLUMINA.txt output/diff_cov5_gen_NA12878_NA12891_ILLUMINA.txt 5 0.05 10

# (c) to import into Excel in Germany, France or some other countries, 
# the English floating point p-value must be turned into a floating-comma p-value.
# This is how it can be done:
for i in diff_gen_NA12878_NA12891_ILLUMINA.txt diff_cov5_gen_NA12878_NA12891_ILLUMINA.txt
do
    #  (c.1)  get  header lines
    grep '^#' output/${i} >> output/${i}.comma.txt
    #  (c.2)  get main table
    grep -v '^#' output/${i} > output/tmptext.txt
    #  (c.3)  replace dots with commas. The backslash masks the dot which is a wildcard character
    sed -i 's/\./,/g' output/tmptext.txt
    #  (c.4) merge the main table into the header lines
    cat output/tmptext.txt >> output/${i}.comma.txt
    rm -f output/tmptext.txt
done

###################
# Consensus genotypes for the whole-genome SOLID, FLX and ILLUMINA runs of sample NA12878:
###################
pibase_consensus output/NA12878_SOLID.txt output/NA12878_FLX.txt output/NA12878_ILLUMINA.txt output/gen_NA12878_SOLID_FLX_ILLUMINA.txt



^top