pibase tools for validational and comparative analysis of BAM files


pibase is an open-source package of linux command line tools for validating next-generation sequencing loci (SNPs and loci of interest where no SNPs are known) and for comparative analyses using Fisher's exact test.

Acknowledgement: The development of pibase was partly funded by: The German Ministry of Education and Research (BMBF); the National Genome Research Network (NGFN); the Deutsche Forschungsgemeinschaft (DFG) Cluster of Excellence 'Inflammation at Interfaces'; the EU Seventh Framework Programme [FP7/2007-2013, grant numbers 201418, READNA and 262055, ESGI].

Disclaimer: pibase is provided free of charge for non-commercial use but you are required to read our disclaimer and to cite us when publishing results.

Download:  pibase 1.4.7   example data (12GB)   example output only (130kb)


Overview


pibase Acronym for: get Position Information at BASE position of interest.

Interoperability Input and output file types.

Work flows Preparing BAM-files and using the pibase tools.

QuickStart Tutorial Prerequisites, installation, and pibase examples using BAM-files from the 1000 Genomes project.


Essentials:

pibase_bamref Extract information from a BAM-file and a reference sequence file and table this information into a tab-separated text file.
pibase_consensus Infer 'best' genotypes and their 'quality' classification, and optionally merge multiple pibase_bamref files (e.g. a control panel or several runs of the same patient) into a single file.
pibase_fisherdiff Compare two pibase_consensus files using Fisher's exact test on original data (aligned reads) rather than comparing processed data (SNP-calls or genotypes).


Annotate:

pibase_tosnpacts Works with our unpublished annotation pipeline. Contact us if you wish us to annotate your pibase files with our annotations.
pibase_annot Works with our unpublished annotation pipeline. Contact us if you wish us to annotate your pibase files with our annotations.
pibase_tag Add primer region tags to a pibase_bamref, pibase_consensus, pibase_fisherdiff, or pibase_annot file.



Phylogenetics:

pibase_to_rdf Generate rdf file (for phylogenetic network analysis) from set of pibase_fisherdiff files.
pibase_rdf_ref Generate a reference sample file from a pibase_consensus file, prior to generating the set of pibase_fisherdiff files required for pibase_to_rdf.
pibase_chrm_to_crs Extract Cambridge Reference Sequence (Anderson 1981) variants from reads mapped to chrM (hg18, hg19, or NCBI36).


Utilities:

pibase_to_vcf Convert a single-sample pibase file into VCFv4.1 format.

pibase_c_to_contig Convert pibase-contig-numbers into contig names (e.g. 25 -> chrM).

pibase_flagsnp Flag non-reference genotypes in a pibase_consensus file as potential mismatches ("SNPs" in NGS-parlance).

pibase_diff Compare two pibase_consensus files using (BestGen) genotypes and a (BestQual) quality threshold.

pibase_gen_from_snpacts Works with our unpublished annotation pipeline. Contact us if you wish us to merge SNPs from SNP-callers, SNP-chips, or Sanger sequences into your pibase files for a genotype-comparison.
pibase_ref Gets 500 flanking nucleotides of reference sequence around each coordinate (of the input list) and outputs a pseudo-FASTA file. For example for manual Sanger-sequencing primer design.



Interoperability


pibase reads genomic coordinates of interest from a VCF*, samtools pileup, SOLiD Bioscope gff3, or a tab-separated file.

pibase extracts data at the coordinates of interest from an indexed FASTA reference and from a BAM-file** generated by BFAST, BWA, SSAHA2, samtools, SOAP (after conversion using soap2sam.pl), and SOLiD Bioscope. To extract the most complete information (including homologous region information and low-coverage genotypes), please use the raw unfiltered BAM-file (which includes non-uniquely mapped reads and duplicate reads).

pibase outputs tab-separated text files which can then be used in popular spreadsheet software, or filtered from the linux command line using grep, awk, and cut. pibase can also output variants into VCF, rdf, and snpActs formats.

*VCF is the variant list format accepted by European Nucleotide Archive. (VCF files are e.g. generated by GATK and samtools.)
**BAM is the sequence format preferred by the European Nucleotide Archive.


Work flows


"Essentials" work flow "Annotate" work flow "Phylogenetics" work flow "Utilities" (Tools for special interests)

QuickStart Tutorial


Prerequisites


Installation

Download pibase.v1.4.7.tar.gz and unzip.
From the linux command line, navigate into the pibase directory and enter ./f.sh to compile the pibase_643 subprogram.
Finally, include the pibase directory in your linux PATH variable.

Test run

Test whether the installations were successful using our example data (12GB size!): Download the 12GB tar.gz file and unzip. From the linux command line, navigate into the example data folder and then run the example shell script by entering
./pibase_test.sh

(If problems occur, check the pre-requisites).
When pibase_test.sh runs successfully, it creates a directory called output, into which the results files are written.

Next, run the phylogenetic example shell script by entering
./pibase_to_rdf.sh

To validate the test runs, compare directories output and output_validated from the linux command line:
diff output output_validated

Windows users

Windows users (and impatient Linux users) can download just the small zipped output_validated folder (130kb) for a quick impression of the output files. Import these tab-separated text files into Excel for viewing, filtering, or sorting. Note that there are floating point and floating comma versions of the files, the latter of which are generated using simple linux commands in the example shell script pibase_test.sh.

Detailed examples

Please look into the shell script pibase_test.sh and pibase_to_rdf.sh web pages which give several examples how to start the pibase tools.


Essentials | pibase_bamref


pibase_bamref extracts information at genomic coordinates of interest from a BAM file and a reference sequence file and tables this information into a tab-spearated text file. Optionally, information can be displayed on the command line. A single genomic coordinate can be specified on the command line, or a list of coordinates can be specified in a file. Note that BAM-files must include MD-tags.

Usage:

pibase_bamref [-[v][d][p chr]] {table or poi} ref bam out LR QVmin [[[MLmin] [[MMmax] [RQVmin] [maxr]]]]

  []      optional

  -vdp    [v]erbose output into terminal,
          [d]etailed list of reads,
          at single [p]osition: chr poi

  chr:    chromosome name (or contig name)
  poi:    position of interest (chromosomal coordinate, 1-based)

  table:  file of poi's: vcf, gff3, pileup,
          or plain tab-file: chr TAB poi LINEBREAK
          The format is detected by pibase_bamref. If the plain tab-file is used,
          the lines must be sorted by chr-names, and the chr-names must be ordered
          in the sequence of the reference sequence file or BAM-file-header.

  ref:    reference sequence file
  bam:    sorted bam file, bam index file must also exist
  out:    name for (tab-separated) output file
  LR:     max length of reads (e.g. 50)

  QVmin:  optional min base quality (default: 20)
  MLmin:  optional min mapped read length (default: 49)
  MMmax:  optional max number of base space mismatches in read (default: 1)
  RQVmin: optional min read mapping quality (default: 20)
  maxr:   optional max number of reads considered at a poi (default: -1)
          (default: unlimited. For ultra-high coverage files:
          use an integer value of e.g. 1000 as the limit).

Examples:
pibase_bamref -vp chrM 16189 hg20.fasta na12752.bam na12752_chrM_16189.txt
pibase_bamref list.txt hg18.fasta na12752.bam na12752_list.txt
pibase_bamref pileupsnps.txt hg18.fasta na12752.bam na12752_pileupsnps.txt
pibase_bamref gatksnps.vcf hg18.fasta na12752.bam na12752_gatksnps.txt


Notes:

If you are generating your own input lists with positions of interest: The genomic coordinates must be 1-based (the UCSC browser, VCF SNPlists, samtools pileup SNPlists, and Bioscope SNPlists use 1-based coordinates). SNPlists from Bioscope 1.2.1 or samtools pileup can be used without checking - these are ok.

pibase_bamref speed depends on depth of coverage and read length, and on python/pysam version. For high coverage BAM-files (50x-100x) and 50bp read lengths, pibase_bamref v1.4.5 processes about 50,000-100,000 genotypes per hour on a single core using 2GB RAM . At least python v2.7 and at least pysam v0.5 are required for this speed, otherwise pibase_bamref is up to 100x slower. As there are critical but silent bugs in other pysam versions, please do not use other pysam versions unless you test that version thoroughly (e.g. by running our chr22 and exome example data sets and performing a diff)!

For the most complete information: give pibase the original BAM file (which should include the ambiguously mapped reads, and which can also include "duplicate" reads)


Essentials | pibase_consensus


pibase_consensus adds information to a pibase_bamref file, inferring 'best' genotypes and 'best qualities', and optionally merging read counts from multiple pibase_bamref files (e.g. a control panel or several runs of the same patient) into a single pibase_consensus file. A BestQual question mark ("?") indicates that the BestGen is instable with respect to filtering, and therefore should not be used without further validations.

Usage:

pibase_consensus [-t[h][v [ver]] t1 t2 t3 t4 t5 {h1 h2}] in1 [in2 ... inN] out

  []=optional    {}=only for "-th" option

  [-t[h][v] (optional) minimal thresholds for allele calling (A, C, G, or T),
            (optionally) followed by "h" to override h1 and h2 defaults,
            (optionally) followed by "v" to revert to an older version

    [ver]   (optional) older pibase_consensus version to revert to.
            To list all available older versions, specify "-tv" without ver
            (older version 1.4.3: arguments h1 and h2 not available)

    t1      min fraction of reads indicating allele (default: 2.2%, i.e. 0.022)
    t2      min number of reads indicating allele (default: 8)
    t3      min fraction of unique start points indicating allele (default: 0.04)
    t4      min number of unique start points indicating allele (default: 4)
    t5      both-stranded confirmation only if, for at least one filter level,
            min-strand-counts >= t5 * max-strand-counts (default t5: 0.2)

    {h1     min fraction of unique start points that need to be different between
            filter F2-F3 resp. F3-F4 in order to flag a hypervariable resp.
            homologous locus (default: 2.2%, i.e. 0.022)

     h2} ]  min number of unique start points difference between F2-F3 resp. F3-F4
            in order to flag a hypervariable resp. homologous locus (default: 2)

  inX       input file[s] = pibase_bamref output file[s]
            must have same reference sequence (not checked!)
            and same positions of interest

  out       name for (tab-separated) output file

Examples:

# Single file
pibase_consensus na12752_pileupsnps.txt genotypes_na12752_pileupsnps.txt

# Two files
pibase_consensus na12752_1.txt na12752_2.txt gen_na12752_1to2.txt

# Decreased sensitivity to sequencing or mapping errors or contamination
pibase_consensus -t 0.1 8 0.15 8 0.01 na12752_pileupsnps.txt genotypes_na12752_pileupsnps.txt


Notes:

What pibase_consensus does:
Best Quality tag:
?1 : Slightly poor genotype quality. Mapping stringency versus reference sequence context class looks ok. Not all 10 genotyping filter stages lead to the same genotype. However, for the high mapping stringency filter stages, at least t4 (default: 4) unique start points and at least t2 (default: 8) reads support this genotype.
?2 : Somewhat poor genotype quality. Mapping stringency versus reference sequence context class looks ok. This genotype is supported by less than 5 filter stages, but by at least 2 filter stages, of which one stage is in the unique start points category, and the other stage is in the coverage category.
?3: Really poor quality. Reference sequence context looks difficult (homopolymeric run>4, or STRs) and mapping stringency was low. But at least one stringent filter supports this genotype.
?4: Very poor quality. Reference sequence context looks difficult (homopolymeric run>4, or STRs) and mapping stringency was low. But at least one of the unique-start-point filters supports this genotype.
?5: Highly problematic quality. The best unique-start-point derived genotype is in conflict to the best coverage-derived genotype.
?6: Highly problematic quality. The best unique-start-point derived genotype is in conflict to the best coverage-derived genotype, and the best coverage-derived genotype is "senior" to the best usp-derived genotype.
?7: Low-coverage guess. The coverage is below t2 (default: 8) (can be as low as 1).
?8: Low-coverage guess. The coverage is below t2 (default: 8) (can be as low as 1). Reference sequence context looks difficult (homopolymeric run>4, or STRs), and there are no stringently mappable reads.

Poor quality genotypes and single-strand genotypes can be filtered using linux commands as follows:

# copy header lines into new file:
grep '^#' pibase_consensus_file > validated_pibase_consensus_file
# append filtered table lines (indel-reads < 3, tags A+- and B+- and BestQual) to the new file:
awk 'BEGIN {FS="\t"};($7<3)&&($10=="+-")&&($11=="+-")' pibase_consensus_file | grep -v '?' >> validated_pibase_consensus_file


Potentially problematic regions can be filtered or counted using linux commands as follows:

# count number of genotypes in homologous loci:
grep -v '^#' pibase_consensus_file | grep 'H' | wc -l
# count number of genotypes in hypervariable loci:
grep -v '^#' pibase_consensus_file | grep 'V' | wc -l
# count number of genotypes near indels:
awk 'BEGIN {FS="\t"};($7>0)' pibase_consensus_file | wc -l



Essentials | pibase_fisherdiff


pibase_fisherdiff merges two pibase_consensus files and tags differences between these two files (e.g. control sample and case sample) based on Fisher's exact test for five different filter levels. (See example.)

Usage:

pibase_fisherdiff control case out [[cov] [[p] [[fac]]]

  control: pibase_consensus file of control (healthy) sample
  case:    pibase_consensus file of case (diseased) sample
  out:     output file
  []       optional
  [cov]:   min coverage threshold (optional. Default: 100)
  [p]:     max median p-value threshold (optional. Default: 0.01)
  [fac]:   max factor (optional. Default: 3.5).

Example:
# min coverage 50, p <= 0.01, fac <= 10
pibase_fisherdiff normal.txt tumor.txt diff_out.txt 50 0.01 10


Notes:

Eight leading columns are added to the pibase_consensus output, and the two samples are merged into one file which can then be filtered using grep or awk or python/perl/java/etc scripts, or imported into Excel by Windows users and viewed/filtered further in Excel: For a posteriori more stringent filtering of differences than in the original pibase_fisherdiff run, it may be of interest to use the linux awk command. Single-strand genotypes and more stringent p-values can be filtered using linux commands as follows:

# copy header lines (i.e. starting with '#') into new file:
grep '^#' pibase_fisherdiff_file > filtered_file
# append filtered table lines to the new file (p-value <= 0.005, and tags A+- and B-):
awk 'BEGIN {FS="\t"};($3<=0.005)&&($18=="+-")&&($19=="+-")' pibase_fisherdiff_file >> filtered_file
# count differences:
grep '^+' filtered_file | wc -l



Annotate | pibase_tosnpacts


pibase_tosnpacts reads the "BestGen" genotypes from a pibase_consensus file or a pibase_fisherdiff file and creates a file (or file pair, if pibase_fisherdiff) for import into snpacts. After annotation within snpacts, the snpacts-annotations are exported and added to the pibase_consensus or pibase_fisherdiff file using pibase_annot.

Usage:

pibase_tospnacts in out1 [out2]

  in:   file from pibase_consensus or pibase_fisherdiff
  out1: output file (for input into snpacts, in 'own' format)
  out2: 2nd output file (case sample) if pibase_fisherdiff

Examples:
# pibase_consensus file:
pibase_tosnpacts geno_na12752.txt for_snpacts_na12752.txt
# pibase_fisherdiff file:
pibase_tosnpacts diff.txt for_snpacts_control.txt for_snpsacts_case.txt


Notes:

How to import the pibase_tosnpacts file into snpacts:

snpact -ft own -sp 1 -pchr 0 -psnp 1 -prb 2 -pbb 3 -psb 4 -pmt 5 -all -hg hg18 -t tablename filename

Remove problematic EOLs from BLOBs in annotated table:

tableupdate -eol tablename

Export the annotated, de-BLOBBed table to csv:

outputact -t tablename -based 1 -csv -worst -hg hg18 inputfilename outputfilename

where inputfilename is the file from pibase_tosnpacts, and outputfilename is the file for pibase_annot


Annotate | pibase_annot


pibase_annot annotates a pibase_consensus or pibase_fisherdiff file with snp information from an exported snpacts table (see above section on pibase_tosnpacts).

Usage:

pibase_annot type an1 [an2] in out

  type: 1=pibase_consensus; 2=pibase_fisherdiff
  an1:  snpacts-annotations file 1 (control sample, created by outputact)
  an2:  snpacts-annotations file 2 (case sample, created by outputact)
        annot2 is not specified for pibase_consensus (type=1)
  in:   tab-separated file from pibase_consensus or pibase_fisherdiff
  out:  results file with merging of annotations and infile

Examples:
# pibase_consensus file:
pibase_annot 1 annot.txt geno_na12752.txt anno_geno_na12752.txt
# pibase_fisherdiff file:
pibase_annot 2 anno_control.txt anno_tumor.txt diff.txt anno_diff.txt



Annotate | pibase_tag


pibase_tag adds primer region tags to a pibase_consensus, pibase_fisherdiff or pibase_annot file. The pibase file is annotated using information from a primer region file.

Usage:

pibase_tag typ pr in chr pos out [tag]

  typ: type of input file (always enter 0)
  pr:  primer region file, tab-separated: chr pos start end strand (int, 1-based)
  in:  tab-separated file, e.g. from pibase_diff or pibase_annot
  chr: column number (1-based) in in-file with chromosome number
  pos: column number (1-based) in in-file with position
  out: output file ( = tagged input file)
  tag: (optional, default=column 2) column number, at which the 4 primer tag columns are inserted.

Example:
# pibase_annot file:
pibase_tag 0 primer_regions.txt pibase_annot.txt 4 5 pibase_tag_annot.txt 2


Notes:

The primer region file must have a special table header (see ## in example below) and is tab-separated:
chromosome, start, end, strand (integer, 1-based).
#dummy test file for pibase_tag
##chr	start	end	strand
1	152644520	152644530	1
1	152644740	152644750	-1


Phylogenetics | pibase_to_rdf


pibase_to_rdf generates an rdf file (for phylogenetic network analysis) from a set of pibase_fisherdiff files.

Usage:

pibase_to_rdf list rdf [[pmax] [both] [elim]]

  list:   text file which lists the set of pibase_fisherdiff files:
          file_name TAB sample_name EOL
  rdf:    name of output file (for analysis using Network)
  []:     optional arguments:
  [pmax]: max p-value to consider (default=0.01)
  [both]: bothstranded confirmation (y/n, default=y)
  [elim]: eliminate "N"-columns and invariable columns (y/n, default=n)


Example: See pibase_to_rdf.sh

Note:

The example above refers to our pibase paper. Please consult this paper if you are unfamiliar with phylogenetic network analyses.




Phylogenetics | pibase_rdf_ref


pibase_rdf_ref generates a reference-base pibase_consensus file which may be used as the reference sample for pibase_fisherdiff comparisons in a group of samples, e.g. for a pibase_to_rdf run and subsequent phylogenetic network analysis.

Usage:

pibase_rdf_ref in out [cov]

  in:   pibase_consensus file (any sample from the group of samples)
  out:  pibase_consensus file with genotypes taken from Ref column of in-file
  cov:  coverage (optional. default=50. Divisible by two.)


Example: See pibase_to_rdf.sh

Note:

The example above refers to our pibase paper. Please consult this paper if you are unfamiliar with phylogenetic network analyses.


Phylogenetics | pibase_chrm_to_crs


pibase_chrm_to_crs extracts Cambridge Reference Sequence (CRS)* variants from chrM** alignments. The CRS numbering is identical to the numbering of the revised CRS (rCRS)***. The reference sequence bases within the mtDNA control region are identical for the CRS and the rCRS.

(In hg18, hg19, and NCBI36, chrM is used as the mitochondrial reference genome; in NCBI build 37 rCRS is used and in hg20, rCRS will be used; most mtDNA databases are based on the CRS/rCRS genomic coordinates.)

Usage:

pibase_chrm_to_crs in out own pro sam [[t] [min]]

  in:    pibase_consensus genotype file (MUST be chrM ONLY!)
  out:   base name of output files. The following files are output:
   out_f4maj.txt: major variants for highest-stringency filter (f4)
   out_f4min.txt: minor variants for highest-stringency filter (f4)
   out_f0maj.txt: major variants for lowest-stringency filter (f0)
   out_f0maj.txt: minor variants for lowest-stringency filter (f0)
   out_all.txt:   complete pibase_consensus file in CRS notation
   out_sum.txt:   summary file in CRS notation
  own:   abbreviation of owner/PrincipalInvestigator of the project
  pro:   for summary: project abbreviation
  sam:   for summary: sample abbreviation
  []:    optional arguments:
  [t]:   threshold for minor allele cut-off
         (default: 0.015, i.e. 1.5% of total read counts at a given coordinate)
  [min]: min number of read counts required for an allele-call (default: 2)

Example:
pibase_chrm_to_crs pat123_chrM.txt pat123_chrs.txt FRA fro Pat123


Note:

The pibase_consensus file must contain only chrM lines. If there are other lines in the file, the header lines and the chrM lines must be extracted before using pibase_chrm_to_crs:

# copy header lines into new file:
grep '^#' pibase_consensus_file > chrM_pibase_consensus_file
# append lines with "chromosome 25" (i.e. chrM) to the new file:
grep '^25' pibase_consensus_file >> chrM_pibase_consensus_file


References:

* Anderson, S., Bankier, A.T., Barrell, B.G., de Bruijn, M.H., Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe, B.A., Sanger, F., Schreier, P.H., Smith, A.J., Staden, R., Young, I.G. (1981) Sequence and organization of the human mitochondrial genome. Nature, 290, 457-465.

** Ingman, M., Kaessmann, H., Pääbo, S., Gyllensten, U. (2000) Mitochondrial genome variation and the origin of modern humans. Nature, 7, 708-713.

*** Andrews, R.M., Kubacka, I., Chinnery, P.F., Lightowlers, R.N., Turnbull, D.M., Howell, N. (1999) Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet., 23, 147.



Utilities | pibase_to_vcf


pibase_to_vcf converts a single-sample pibase file into VCFv4.1 format.

Usage:

pibase_to_vcf [-h] [-v] [-d] [-u s s s s s s s s s s] in out nam bam

  in:  tab-separated file (pibase_consensus)
  out: VCF file
  nam: name of sample in VCF file (column header)
  bam: name of bam file from which in file was generated

Example:
pibase_c_to_contig in.txt out.txt mysample mysample.bam




Utilities | pibase_c_to_contig


pibase_c_to_contig converts pibase-contig-numbers into contig names (e.g. 25 -> chrM).

Usage:

pibase_c_to_contig in [in2] out

  in:  file to be converted (from pibase_consensus, pibase_fisherdiff, ...)
       Note: pibase_fisherdiff has a shortened file head, so a second input file
       with the conversion table must be given.

  in2: file with contig numbering/name table in header
       (required, if the first input file is a pibase_fisherdiff file)

  out: output file

Examples:
pibase_c_to_contig gen.txt gen_contigs.txt
pibase_c_to_contig diff.txt table.txt diff_contigs.txt




Utilities | pibase_flagsnp


pibase_flagsnp flags a BestGen genotype as being an "NGS SNP" (i.e. a mismatch) if it is different to the reference base. Three columns are inserted before the "BestGen" column: "IsSnp" (1/0 flag), "SnpReads" (non-reference-allele coverage at filter level 0), and "CovAll" (coverage at filter level 0, consisting of all reads indicating A, C, G, or T but not N)

To sort the pibase_consensus "BestGen" genotypes into Non-Reference (i.e. SNP) and Reference.

Usage:

pibase_flagsnp in out

  in:  tab-separated file (pibase_consensus, pibase_fisherdiff, or pibase_diff)
  out: results file with merging of annotations and infile

Example:
pibase_flagsnp gen_id100.txt snp_gen_id100.txt



Note: Outside the NGS world, "SNPs" include reference genotypes which are polymorphisms occurring in 1% or more of a population - this is often a matter of confusion between NGS and non-NGS specialists.


Utilities | pibase_diff


pibase_diff compares two pibase_consensus files using (BestGen) genotypes and a (BestQual) quality threshold. (Accuracy can be much lower than pibase_fisherdiff, i.e. up 10x or more false positives and false negatives, but compute time is fast.)

Usage:

pibase_diff control case out worst

  control: pibase_consensus file of control (healthy) sample
  case:    pibase_consensus file of case (diseased) sample
  out:     output file
  worst:   worst BestQual level to consider (0-8)

Examples:
# Consider BestGens up to a worst BestQual of ?1 (i.e. ignore ?2 to ?8):
pibase_diff normal.txt tumor.txt diff_out.txt 1


Notes:

One leading column is added to the pibase_consensus output, and the two samples are merged into one file which can then be filtered using grep or python/perl/java/etc scripts, or imported into Excel by Windows users and viewed/filtered further in Excel: Differences and single-strand genotypes can be filtered using linux commands as follows:

# copy header lines into new file:
grep '^#' pibase_diff_file > filtered_diff_file
# append filtered lines to the new file (filtering by first column and tags A+- and B+-):
awk 'BEGIN {FS="\t"};($11=="+-")&&($12=="+-")' pibase_diff_file | grep -v '^=' >> filtered_diff_file
# count differences:
grep '^+' filtered_diff_file | wc -l



Utilities | pibase_gen_from_snpacts


pibase_gen_from_snpacts inserts snpacts genotypes into a pibase file before the "BestGen" column. Multiple runs of this script can be performed to insert genotypes from several different sources (e.g. samtools, Bioscope, GATK). Only chromosomes chr1-chr22, chrX, chrY are recognised in snpacts files.

To compare genotypes from different sources with signals and genotypes from pibase_consensus.

Usage:

pibase_gen_from_snpacts typ s1 [s2] in out hdr

  typ: type of input file (1=pibase_consensus or pibase_gen_from_snpacts)
  s1:  snpacts file1, created by outputact
  s2:  snpacts file2, (not specified for typ=1)
  in:  tab-separated file, from pibase_consensus or pibase_gen_from_snpacts
  out: results file with merging of annotations and infile
  hdr: header of inserted snpacts-column (e.g. samtools, Bioscope, GATK)

Example:
# add samtools SNPs into pibase_consensus file:
pibase_gen_from_snpacts 1 id100_samtools.txt gen_id100.txt sam_gen_id100.txt samtools
# add Sanger SNPs into samtools-pibase_consensus file:
pibase_gen_from_snpacts 1 id100_sanger.txt sam_gen_id100.txt san_sam_gen_id100.txt sanger




Utilities | pibase_ref


pibase_ref gets 500 flanking nucleotides of reference sequence around each coordinate (of the input list) and outputs a pseudo-FASTA file.

Primary use: manual Sanger-sequencing primer design.

Usage:

pibase_ref in ref out

  in:   file with list of genomic coordinates (e.g. a VCF file):
        chromosome_name TAB 1_based_coordinate_in_chromosome

  ref:  indexed reference sequence file (fai must exist)

  out:  FASTA-formatted output file

Example:
pibase_ref snplist.vcf hg19.fa flanked_snps.fasta




Citing pibase


Please cite this website www.ikmb.uni-kiel.de/pibase as well as our paper

Forster, M., Forster, P., Elsharawy, A., Hemmrich, G., Kreck, B., Wittig, M., Thomsen, I., Stade, B., Barann, M., Ellinghaus, D., Petersen, B.S., May, S., Melum, E., Schilhabel, M.B., Keller, A., Schreiber, S., Rosenstiel, P., Franke, A.

From next-generation sequencing alignments to accurate comparison and validation of single nucleotide variants: the pibase software.

Nucleic Acids Res., doi: 10.1093/nar/gks836


Disclaimer, terms, and conditions


The software, documentation and any help or support are provided "as-is" and without any warranty of any kind. Installation of the required python module pysam and its dependencies must be carried out by a competent linux user or administrator at their own risk (installation of modules or dependencies can render previously working software or even the operating system unusable).

By downloading or using the pibase tools, the Non-Commercial User accepts these unmodified terms and conditions.

The term "Software" in these terms and conditions refers to the pibase software components, including upgrades, additions, copies and documentation as supplied by the Institute of Clinical Molecular Biology.
The term "Non-Commercial User" in these terms and conditions refers to the person who uses the Software for non-commercial purposes. Commercial use is prohibited unless expressly permitted in writing by the Institute of Clinical Molecular Biology.
The term "person" includes the organisation or group if the Software is used within the activities of an organisation or group.
The Institute of Clinical Molecular Biology grants to the Non-Commercial User a non-exclusive license to use the Software free of charge. The Institute of Clinical Molecular Biology grants to the Non-Commercial User a non-exclusive license to use, store, publish and exploit free of charge and without any time limit any results which the Non-Commercial User has generated with the help of the Software, provided that the Non-Commercial User cites the Software and version, the web site www.ikmb.uni-kiel/pibase and the paper on pibase.

The Non-Commercial User acknowledges that the Software is a tool to be used by the Non-Commercial User in his or her work and that the Non-Commercial User shall remain solely responsible for the content and quality of any results produced by the Non-Commercial User from the use or with the aid of the Software. The Non-Commercial User undertakes to indemnify and hold the Institute of Clinical Molecular Biology harmless in the event of any claim against the Institute of Clinical Molecular Biology arising from any incorrect or misleading information that the Non-Commercial User has obtained with the aid or by the use of the Software.

The Institute of Clinical Molecular Biology reserves all intellectual property rights of the source code and prohibits the copying, re-writing, or re-use of the code or parts thereof except for the purposes expressly granted in this agreement. These terms and conditions shall be governed by and construed and interpreted in accordance with German Law and submitted to the sole jurisdiction of the court of Kiel (Germany).



Contact


Michael Forster or Prof. Dr. rer. nat. Andre Franke.

m [dott] forster [att] ikmb [dott] uni - kiel [dott] de

a [dott] franke [att] mucosa [dott] de


Institute of Clinical Molecular Biology
Christian-Albrechts-University of Kiel
Schittenhelmstrasse 12
D-24105 Kiel
Germany


^top