VCFtools

As of July 2015, the VCFtools project has been moved to github! Please visit the new website here: vcftools.github.io/perl_module.html

The Perl modules and scripts

VCFtools contains a Perl API (Vcf.pm) and a number of Perl scripts that can be used to perform common tasks with VCF files such as file validation, file merging, intersecting, complements, etc. The Perl tools support all versions of the VCF specification (3.2, 3.3, 4.0, 4.1 and 4.2), nevertheless, the users are encouraged to use the latest versions VCFv4.1 or VCFv4.2. The VCFtools in general have been used mainly with diploid data, but the Perl tools aim to support polyploid data as well. Run any of the Perl scripts with the --help switch to obtain more help.

Many of the Perl scripts require that the VCF files are compressed by bgzip and indexed by tabix (both tools are part of the tabix package, available for download here). The VCF files can be compressed and indexed using the following commands

bgzip my_file.vcf
tabix -p vcf my_file.vcf.gz

The tools


fill-an-ac

Fill or recalculate AN and AC INFO fields.

zcat file.vcf.gz | fill-an-ac | bgzip -c > out.vcf.gz

(Read more)

fill-fs

Annotates the VCF file with flanking sequence (INFO/FS tag) masking known variants with N's. Useful for designing primers.

fill-fs -r /path/to/refseq.fa | vcf-query '%CHROM\t%POS\t%INFO/FS\n' > out.tab

(Read more)

fill-ref-md5

Fill missing reference info and sequence MD5s into VCF header.

fill-ref-md5 -i "SP:Homo\ Sapiens" -r ref.fasta in.vcf.gz -d ref.dict out.vcf.gz

(Read more)

fill-rsIDs

Fill missing rsIDs. This script has been discontinued, please use vcf-annotate instead.

vcf-annotate

The script adds or removes filters and custom annotations to VCF files. To add custom annotations to VCF files, create TAB delimited file with annotations such as

#CHR FROM TO ANNOTATION 1 12345 22345 gene1 1 67890 77890 gene2

Compress the file (using bgzip annotations), index (using tabix -s 1 -b 2 -e 3 annotations.gz) and run

cat in.vcf | vcf-annotate -a annotations.gz \
   -d key=INFO,ID=ANN,Number=1,Type=Integer,Description='My custom annotation' \
   -c CHROM,FROM,TO,INFO/ANN > out.vcf

The script is also routinely used to apply filters. There are a number of predefined filters and custom filters can be easily added, see vcf-annotate -h for examples. Some of the predefined filters take advantage of tags added by bcftools, the descriptions of the most frequently asked ones follow:

Strand Bias .. Tests if variant bases tend to come from one strand. Fisher's exact test for 2x2 contingency table where the row variable is being the reference allele or not and the column variable is strand. Two-tail P-value is used.
End Distance Bias .. Tests if variant bases tend to occur at a fixed distance from the end of reads, which is usually an indication of misalignment. (T-test)
Base Quality Bias .. Tests if variant bases tend to occur with a quality bias (T-test). This filter is by default effectively disabled as it is set to 0.

Note: A fast htslib C version of this tool is now available (see bcftools annotate).

(Read more)
(Read even more)

vcf-compare

Compares positions in two or more VCF files and outputs the numbers of positions contained in one but not the other files; two but not the other files, etc, which comes handy when generating Venn diagrams. The script also computes numbers such as nonreference discordance rates (including multiallelic sites), compares actual sequence (useful when comparing indels), etc.

vcf-compare -H A.vcf.gz B.vcf.gz C.vcf.gz

Note: A fast htslib C version of this tool is now available (see bcftools stats).

(Read more)

vcf-concat

Concatenates VCF files (for example split by chromosome). Note that the input and output VCFs will have the same number of columns, the script does not merge VCFs by position (see also vcf-merge).

In the basic mode it does not do anything fancy except for a sanity check that all files have the same columns. When run with the -s option, it will perform a partial merge sort, looking at limited number of open files simultaneously.

vcf-concat A.vcf.gz B.vcf.gz C.vcf.gz | gzip -c > out.vcf.gz

(Read more)

vcf-consensus

Apply VCF variants to a fasta file to create consensus sequence.

cat ref.fa | vcf-consensus file.vcf.gz > out.fa

(Read more)

vcf-convert

Convert between VCF versions, currently from VCFv3.3 to VCFv4.0.

zcat file.vcf.gz | vcf-convert -r reference.fa > out.vcf

(Read more)

vcf-contrast

A tool for finding differences between groups of samples, useful in trio analysises, cancer genomes etc.

In the example below variants with average mapping quality of 30 (-f MinMQ=30) and minimum depth of 10 (-d 10) are considered. Only novel alleles are reported (-n). Then vcf-query is used to extract the INFO/NOVEL* annotations into a table. Finally the sites are sorted by confidence of the site being different in the child (-k5,5nr).

vcf-annotate -f MinMQ=30 file.vcf | vcf-contrast -n +Child -Mother,Father -d 10 -f | vcf-query -f '%CHROM %POS\t%INFO/NOVELTY\t%INFO/NOVELAL\t%INFO/NOVELGT[\t%SAMPLE %GTR %PL]\n' | sort -k3,3nr | head

(Read more)

vcf-filter

Please take a look at vcf-annotate and bcftools view which does what you are looking for. Apologies for the non-intuitive naming.
Note: A fast HTSlib C version of a filtering tool is now available (see bcftools filter and bcftools view).

vcf-fix-newlines

Fixes diploid vs haploid genotypes on sex chromosomes, including the pseudoautosomal regions.

(Read more)

vcf-fix-ploidy

Fixes diploid vs haploid genotypes on sex chromosomes, including the pseudoautosomal regions.

(Read more)

vcf-indel-stats

Calculate in-frame ratio.

Note: A fast htslib C version of this tool is now available (see bcftools stats).

(Read more)

vcf-isec

Creates intersections and complements of two or more VCF files. Given multiple VCF files, it can output the list of positions which are shared by at least N files, at most N files, exactly N files, etc. The first example below outputs positions shared by at least two files and the second outputs positions present in the files A but absent from files B and C.

vcf-isec -n +2 A.vcf.gz B.vcf.gz | bgzip -c > out.vcf.gz
vcf-isec -c A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz

Note: A fast htslib C version of this tool is now available (see bcftools isec).

(Read more)

vcf-merge

Merges two or more VCF files into one so that, for example, if two source files had one column each, on output will be printed a file with two columns. See also vcf-concat for concatenating VCFs split by chromosome.

vcf-merge A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz

Note that this script is not intended for concatenating VCF files. For this, use vcf-concat instead.
Note: A fast htslib C version of this tool is now available (see bcftools merge).

(Read more)

vcf-phased-join

Concatenates multiple overlapping VCFs preserving phasing.

(Read more)

vcf-query

Powerful tool for converting VCF files into format defined by the user. Supports retrieval of subsets of positions, columns and fields.

vcf-query file.vcf.gz 1:10327-10330
vcf-query file.vcf -f '%CHROM:%POS %REF %ALT [ %DP]\n'

Note: A fast htslib C version of this tool is now available (see bcftools query).

(Read more)

vcf-shuffle-cols

Reorder columns

vcf-shuffle-cols -t template.vcf.gz file.vcf.gz > out.vcf

(Read more)

vcf-sort

Sort a VCF file.

vcf-sort file.vcf.gz

(Read more)

vcf-stats

Outputs some basic statistics: the number of SNPs, indels, etc.

vcf-stats file.vcf.gz

Note: A fast htslib C version of this tool is now available (see bcftools stats).

(Read more)

vcf-subset

Remove some columns from the VCF file.

vcf-subset -c NA0001,NA0002 file.vcf.gz | bgzip -c > out.vcf.gz

Note: A fast HTSlib C version of this tool is now available (see bcftools view).

(Read more)

vcf-tstv

A lightweight script for quick calculation of Ts/Tv ratio.

cat file.vcf | vcf-tstv

Note: A fast htslib C version of this tool is now available (see bcftools stats).

(Read more)

vcf-to-tab

A simple script which converts the VCF file into a tab-delimited text file listing the actual variants instead of ALT indexes.

zcat file.vcf.gz | vcf-to-tab > out.tab

(Read more)

vcf-validator

vcf-validator file.vcf.gz

(Read more)

Vcf.pm

For examples how to use the Perl API, it is best to look at some of the simpler scripts, for example vcf-to-tab. The detailed documentation can be obtained by running

perldoc Vcf.pm