The Perl modules and scripts

VCFtools contains a Perl API (Vcf.pm) and a number of Perl scripts that can be used to perform common tasks with VCF files such as file validation, file merging, intersecting, complements, etc. The Perl tools support all versions of the VCF specification (3.2, 3.3, 4.0 and 4.1), nevertheless, the users are encouraged to use the latest versions VCFv4.0 or VCFv4.1. The VCFtools in general have been used mainly with diploid data, but the Perl tools aim to support polyploid data as well.

Run any of the Perl scripts with the --help switch to obtain more help. Note that the PERL5LIB environment variable must contain the path to your VCFtools installation in order for the scripts to work.

export PERL5LIB=/path/to/your/vcftools-directory/

Many of the Perl scripts require that the VCF files are compressed by bgzip and indexed by tabix (both tools are part of the tabix package, available for download here). The VCF files can be compressed and indexed using the following commands

bgzip my_file.vcf
tabix -p vcf my_file.vcf.gz

The tools

fill-aa

Fill in ancestral alleles.

zcat file.vcf.gz | fill-aa -a ancestral-alleles.fa.gz | bgzip -c > out.vcf.gz

(Read more)
About: This script fills ancestral alleles into INFO column of VCF files. It depends on samtools,
   therefore the fasta sequence must be gzipped (not bgzipped!) and indexed by samtools faidx.
   The AA files can be downloaded from
       ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ancestral_alignments
   and processed as shown in the example below. This is because the sequences in the original files
   are named as 'ANCESTOR_for_chromosome:NCBI36:1:1:247249719', but the underlying FaSplice.pm
   requires names as 'chr1' or '1'.
Usage: fill-aa [OPTIONS] < in.vcf >out.vcf
Options:
   -a, --ancestral-allele <prefix>     Prefix to ancestral allele chromosome files.
   -t, --type <list>                   Variant types to process: all,indel,ref,snp. [all]
   -h, -?, --help                      This help message.
Example:
   # Get the files ready: compress by gzip and index by samtools faidx. Either repeat the
   # following command for each file manually
   bzcat human_ancestor_1.fa.bz2 | sed 's,^>.*,>1,' | gzip -c > human_ancestor_1.fa.gz
   samtools faidx human_ancestor_1.fa.gz

   # .. or use this loop (tested in bash shell)
   ls human_ancestor_*.fa.bz2 | while read IN; do
       OUT=`echo $IN | sed 's,bz2$,gz,'`
       CHR=`echo $IN | sed 's,human_ancestor_,, ; s,.fa.bz2,,'`
       bzcat $IN | sed "s,^>.*,>$CHR," | gzip -c > $OUT
       samtools faidx $OUT
   done

   # After this has been done, the following command should return 'TACGTGGcTGCTCTCACACAT'
   samtools faidx human_ancestor_1.fa.gz 1:1000000-1000020

   # Now the files are ready to use with fill-aa. Note that the VCF file
   # should be sorted (see vcf-sort), otherwise the performance would be seriously
   # affected.
   cat file.vcf | fill-aa -a human_ancestor_ 2>test.err | gzip -c >out.vcf.gz
fill-an-ac

Fill or recalculate AN and AC INFO fields.

zcat file.vcf.gz | fill-an-ac | bgzip -c > out.vcf.gz

(Read more)
Usage: fill-an-ac [OPTIONS] < in.vcf >out.vcf
Options:
   -h, -?, --help                      This help message.
fill-fs

Annotates the VCF file with flanking sequence (INFO/FS tag) masking known variants with N's. Useful for designing primers.

fill-fs -r /path/to/refseq.fa | vcf-query '%CHROM\t%POS\t%INFO/FS\n' > out.tab

(Read more)
About: Annotate VCF with flanking sequence (INFO/FS tag)
Usage: fill-fs [OPTIONS] file.vcf
Options:
   -b, --bed-mask <file>           Regions to mask (tabix indexed), multiple files can be given
   -c, --cluster <int>             Do self-masking of clustered variants within this range.
   -l, --length <int>              Flanking sequence length [100]
   -m, --mask-char <char|lc>       The character to use or "lc" for lowercase. This option must preceed
                                       -b, -v or -c in order to take effect. With multiple files works
                                        as a switch on the command line, see the example below [N]
   -r, --refseq <file>             The reference sequence.
   -v, --vcf-mask <file>           Mask known variants in the flanking sequence, multiple files can be given (tabix indexed)
   -h, -?, --help                  This help message.
Example:
   # Mask variants from the VCF file with N's and use lowercase for the bed file regions
   fill-fs file.vcf -v mask.vcf -m lc -b mask.bed
fill-ref-md5

Fill missing reference info and sequence MD5s into VCF header.

fill-ref-md5 -i "SP:Homo\ Sapiens" -r ref.fasta in.vcf.gz -d ref.dict out.vcf.gz

(Read more)
About: The script computes MD5 sum of the reference sequence and inserts
   'reference' and 'contig' tags into header as recommended by VCFv4.1.
   The VCF file must be compressed and tabix indexed, as it takes advantage
   of the lightning fast tabix reheader functionality.
Usage: fill-ref-md5 [OPTIONS] in.vcf.gz out.vcf.gz
Options:
   -d, --dictionary <file>             Where to read/write computed MD5s. Opened in append mode, existing records are not touched.
   -i, --info <AS:xx,SP:xx,TX:xx>      Optional info on reference assembly (AS), species (SP), taxonomy (TX)
   -r, --refseq <file>                 The reference sequence in fasta format indexed by samtools faidx
   -h, -?, --help                      This help message.
Examples:
   fill-ref-md5 -i AS:NCBIM37,SP:"Mus\ Musculus" -r NCBIM37_um.fa  -d NCBIM37_um.fa.dict in.vcf.gz out.vcf.gz
fill-rsIDs

Fill missing rsIDs. This script has been discontinued, please use vcf-annotate instead.

vcf-annotate

The script adds or removes filters and custom annotations to VCF files. To add custom annotations to VCF files, create TAB delimited file with annotations such as

#CHR FROM TO ANNOTATION 1 12345 22345 gene1 1 67890 77890 gene2

Compress the file (using bgzip annotations), index (using tabix -s 1 -b 2 -e 3 annotations.gz) and run

cat in.vcf | vcf-annotate -a annotations.gz \
   -d key=INFO,ID=ANN,Number=1,Type=Integer,Description='My custom annotation' \
   -c CHROM,FROM,TO,INFO/ANN > out.vcf

The script is also routinely used to apply filters. There are a number of predefined filters and custom filters can be easily added, see vcf-annotate -h for examples. Some of the predefined filters take advantage of tags added by bcftools, the descriptions of the most frequently asked ones follow:

Strand Bias .. Tests if variant bases tend to come from one strand. Fisher's exact test for 2x2 contingency table where the row variable is being the reference allele or not and the column variable is strand. Two-tail P-value is used.
End Distance Bias .. Tests if variant bases tend to occur at a fixed distance from the end of reads, which is usually an indication of misalignment. (T-test)
Base Quality Bias .. Tests if variant bases tend to occur with a quality bias (T-test). This filter is by default effectively disabled as it is set to 0.
(Read more)
About: Annotates VCF file, adding filters or custom annotations. Requires tabix indexed file with annotations.
   Currently it can annotate ID, QUAL, FILTER and INFO columns, but will be extended on popular demand.
   For examples of user-defined filters see online documentation or examples/filters.txt in vcftools distribution.
Usage: cat in.vcf | vcf-annotate [OPTIONS] > out.vcf
Options:
   -a, --annotations <file.gz>         The tabix indexed file with the annotations: CHR\tFROM[\tTO][\tVALUE]+.
   -c, --columns <list>                The list of columns in the annotation file, e.g. CHROM,FROM,TO,-,QUAL,INFO/STR,INFO/GN. The dash
                                           in this example indicates that the third column should be ignored. If TO is not
                                           present, it is assumed that TO equals to FROM. When REF and ALT columns are present, only
                                           matching lines are annotated.
   -d, --description <file|string>     Header annotation, e.g. key=INFO,ID=HM2,Number=0,Type=Flag,Description='HapMap2 membership'.
                                           The descriptions can be read from a file, one annotation per line.
       --fill-AC-AN                    (Re)Calculate AC and AN tags
       --fill-HWE                      (Re)Calculate HWE, AC and AN tags
       --fill-ICF                      (Re)Calculate Inbreeding Coefficient F, HWE, AC and AN
       --fill-type                     Annotate INFO/TYPE with snp,del,ins,mnp,complex
   -f, --filter <file|list>            Apply filters, list is in the format flt1=value/flt2/flt3=value/etc. If argument to -f is a file,
                                           user-defined filters be applied. See User Defined Filters below.
   -H, --hard-filter                   Remove lines with FILTER anything else than PASS or "."
   -n, --normalize-alleles             Make REF and ALT alleles more compact if possible (e.g. TA,TAA -> T,TA).
   -r, --remove <list>                 Comma-separated list of tags to be removed (e.g. ID,INFO/DP,FORMAT/DP,FILTER).
   -h, -?, --help                      This help message.
Filters:
	+                           		Apply all filters with default values (can be overriden, see the example below).
	-X                          		Exclude the filter X
	1, StrandBias  FLOAT        		Min P-value for strand bias (INFO/PV4) [0.0001]
	2, BaseQualBias  FLOAT      		Min P-value for baseQ bias (INFO/PV4) [0]
	3, MapQualBias  FLOAT       		Min P-value for mapQ bias (INFO/PV4) [0]
	4, EndDistBias  FLOAT       		Min P-value for end distance bias (INFO/PV4) [0.0001]
	a, MinAB  INT               		Minimum number of alternate bases (INFO/DP4) [2]
	c, SnpCluster  INT1,INT2    		Filters clusters of 'INT1' or more SNPs within a run of 'INT2' bases []
	D, MaxDP  INT               		Maximum read depth (INFO/DP or INFO/DP4) [10000000]
	d, MinDP  INT               		Minimum read depth (INFO/DP or INFO/DP4) [2]
	H, HWE  FLOAT               		Minimum P-value for HWE and F<0 (invokes --fill-HWE) []
	H2, HWE2  FLOAT              		Minimum P-value for HWE (plus F<0) (INFO/AC and INFO/AN or --fill-AC-AN) []
	HG, HWE_G3  FLOAT            		Minimum P-value for HWE and F<0 (INFO/HWE and INFO/G3) []
	q, MinMQ  INT               		Minimum RMS mapping quality for SNPs (INFO/MQ) [10]
	Q, Qual  INT                		Minimum value of the QUAL field [10]
	r, RefN                     		Reference base is N []
	v, VDB  FLOAT               		Minimum Variant Distance Bias (INFO/VDB) [0]
	W, GapWin  INT              		Window size for filtering adjacent gaps [3]
	w, SnpGap  INT              		SNP within INT bp around a gap to be filtered [10]
Examples:
   zcat in.vcf.gz | vcf-annotate -a annotations.gz -d descriptions.txt -c FROM,TO,CHROM,ID,INFO/DP | bgzip -c >out.vcf.gz
   zcat in.vcf.gz | vcf-annotate -f +/-a/c=3,10/q=3/d=5/-D -a annotations.gz -d key=INFO,ID=GN,Number=1,Type=String,Description='Gene Name' | bgzip -c >out.vcf.gz
   zcat in.vcf.gz | vcf-annotate -a dbSNPv132.tab.gz -c CHROM,POS,REF,ALT,ID,-,-,- | bgzip -c >out.vcf.gz
   zcat in.vcf.gz | vcf-annotate -r FILTER/MinDP | bgzip -c >out.vcf.gz
Where descriptions.txt contains:
   key=INFO,ID=GN,Number=1,Type=String,Description='Gene Name'
   key=INFO,ID=STR,Number=1,Type=Integer,Description='Strand'
The file dbSNPv132.tab.gz with dbSNP IDs can be downloaded from
   ftp://ftp.sanger.ac.uk/pub/1000genomes/pd3/dbSNP/
(Read even more)
# Examples of user-defined filters. Edit and run with -f filters.txt.
# The examples below are self-explanatory. Notice the use of the predefined
#  variables ($PASS, $FAIL, $MATCH, $RECORD) and methods (error).


# In this example, a minimum value of AF1=0.1 is required
{
    tag  => 'INFO/AF1',                       # The VCF tag to apply this filter on
        name => 'MinAF',                          # The filter ID
        desc => 'Minimum AF1 [0.01]',             # Description for the VCF header
        test => sub { return $MATCH < 0.01 ? $FAIL : $PASS },
},


# Filter all indels (presence of INDEL tag is tested)
{
    tag      => 'INFO/INDEL',
    apply_to => 'indels',         # Can be one of SNPs, indels, all. Default: [All]
    name     => 'Indel',
    desc     => 'INDEL tag present',
    test     => sub { return $FAIL },
},


# Only loci with enough reads supporting the variant will pass the filter
{
    tag      => 'INFO/DP4',
    name     => 'FewAlts',
    desc     => 'Too few reads supporting the variant',
    apply_to => 'SNPs',
    test     => sub {
        if ( !($MATCH =~ /^([^,]+),([^,]+),([^,]+),(.+)$/) )
        {
            error("Could not parse INFO/DP4: $CHROM:$POS [$MATCH]");
        }
        if ( 0.1*($1+$2) > $3+$4  ) { return $PASS; }
        return $FAIL;
    },
},

# Example of filtering based on genotype columns and the QUAL column
{
    tag      => 'FORMAT/PL',
    name     => 'NoHets',
    desc     => 'Inbred homozygous mouse, no hets expected',
    apply_to => 'SNPs',
    test     => sub {
            for my $pl (@$MATCH)
            {
                my @pls = split(/,/,$pl);
                if ( $pls[1]<$pls[0] && $pls[1]<$pls[2] ) { return $FAIL; }
            }
        return $PASS;
    },
},


# This example splits the four PV4 values into four tags names PV0, PV1, PV2 and PV3.
#   Note the use of the 'header' key, and the $RECORD and $VCF variables.
{
    header   => [
                    qq[key=INFO,ID=PV0,Number=1,Type=Float,Description="P-value for strand bias"],
                    qq[key=INFO,ID=PV1,Number=1,Type=Float,Description="P-value for baseQ bias"],
                    qq[key=INFO,ID=PV2,Number=1,Type=Float,Description="P-value for mapQ bias"],
                    qq[key=INFO,ID=PV3,Number=1,Type=Float,Description="P-value for tail distance bias"]
                ],
    tag      => 'INFO/PV4',
    name     => 'SplitPV4',
    desc     => 'Split PV4',
    apply_to => 'all',
    test     => sub {
        my @vals = split(/,/,$MATCH);
        $$RECORD[7] = $VCF->add_info_field($$RECORD[7],'PV0'=>$vals[0],'PV1'=>$vals[1],'PV2'=>$vals[2],'PV3'=>$vals[3]);
        return $PASS;
    },
},

# Do whatever you want with every record and edit it according to your needs. This silly
#   example removes the tag SILLY in records where ID is set and depth is bigger than 5.
{
    tag      => 'Dummy',
    test     => sub {
        if ( $$RECORD[2] eq '.' ) { return $PASS; } # Modify only lines with ID
        my $dp = $vcf->get_info_field($$RECORD[7],'DP');
        if ( $dp>5 ) { $$RECORD[7] = $VCF->add_info_field($$RECORD[7],'SILLY'=>undef); }
        return $PASS;
    },
}


# Filter records with the value XY absent or not equal to 42
{
    tag      => 'Dummy',
    header   => [
        qq[key=FILTER,ID=XY,Description="XY not OK"],
    ],
    test     => sub {
        my $xy      = $VCF->get_info_field($$RECORD[7],'XY');
        my $is_bad  = ( !defined $xy or $xy!=42 ) ? 1 : 0;
        $$RECORD[6] = $VCF->add_filter($$RECORD[6],'XY'=>$is_bad);
        return $PASS;
    },
},


# Annotate INFO field with SINGLETON flag when one and only one sample is different from the reference
{
    header   => [
        qq[key=INFO,ID=SINGLETON,Number=0,Type=Flag,Description="Only one non-ref sample"],
    ],
    tag      => 'FORMAT/GT',
    name     => 'Dummy',
    desc     => 'Dummy',
    test     => sub {
        my $nalt = 0;
        for my $gt (@$MATCH)
        {
            my @gt = $VCF->split_gt($gt);
            for my $allele (@gt)
            {
                if ( $allele ne 0 && $allele ne '.' ) { $nalt++; last; }
            }
            if ( $nalt>1 ) { last; }
        }
        if ( $nalt==1 ) { $$RECORD[7] = $VCF->add_info_field($$RECORD[7],'SINGLETON'=>''); }
        return $PASS;
    },
},


# Set genotypes to unknown ("." or "./." depending on ploidy) when coverage is low (by Shane McCarthy).
{
    tag      => 'FORMAT/DP',
    name     => 'MinSampleDP',
    desc     => 'Genotypes set to . for samples with DP < 2',
    apply_to => 'all',
    test     => sub {
        my $i = 8;
        for my $dp (@$MATCH)
        {
            $i++;
            next unless ($dp<2);
            my @format = split(/:/,$$RECORD[$i]);
            $format[0] = $format[0] =~ /\// ? "./." : ".";
            $$RECORD[$i] = join(":",@format);
        }
        return $PASS;
    },
},
vcf-compare

Compares positions in two or more VCF files and outputs the numbers of positions contained in one but not the other files; two but not the other files, etc, which comes handy when generating Venn diagrams. The script also computes numbers such as nonreference discordance rates (including multiallelic sites), compares actual sequence (useful when comparing indels), etc.

vcf-compare -H A.vcf.gz B.vcf.gz C.vcf.gz


Note: a fast htslib C version of this tool is now available (see vcf check).
(Read more)
About: Compare bgzipped and tabix indexed VCF files. (E.g. bgzip file.vcf; tabix -p vcf file.vcf.gz)
Usage: vcf-compare [OPTIONS] file1.vcf file2.vcf ...
       vcf-compare -p plots chr1.cmp chr2.cmp ...
Options:
   -a, --apply-filters                 Ignore lines where FILTER column is anything else than PASS or '.'
   -c, --chromosomes <list|file>       Same as -r, left for backward compatibility. Please do not use as it will be dropped in the future.
   -d, --debug                         Debugging information. Giving the option multiple times increases verbosity
   -g, --cmp-genotypes                 Compare genotypes, not only positions
       --ignore-indels                 Exclude sites containing indels from genotype comparison
   -m, --name-mapping <list|file>      Use with -g when comparing files with differing column names. The argument to this options is a
                                           comma-separated list or one mapping per line in a file. The names are colon separated and must
                                           appear in the same order as the files on the command line.
       --INFO <string> [<int>]         Calculate genotype errors by INFO. Use zero based indecies if field has more than one value. Can be
                                           given multiple times.
   -p, --plot <prefix>                 Create plots. Multiple files (e.g. per-chromosome outputs from vcf-compare) can be given.
   -R, --refseq <file>                 Compare the actual sequence, not just positions. Use with -w to compare indels.
   -r, --regions <list|file>           Process the given regions (comma-separated list or one region per line in a file).
   -s, --samples <list|file>           Process only the listed samples. Excluding unwanted samples may increase performance considerably.
   -t, --title <string>                Title for graphs (see also -p)
   -w, --win <int>                     In repetitive sequences, the same indel can be called at different positions. Consider
                                           records this far apart as matching (be it a SNP or an indel).
   -h, -?, --help                      This help message.
vcf-concat

Concatenates VCF files (for example split by chromosome). Note that the input and output VCFs will have the same number of columns, the script does not merge VCFs by position (see also vcf-merge).

In the basic mode it does not do anything fancy except for a sanity check that all files have the same columns. When run with the -s option, it will perform a partial merge sort, looking at limited number of open files simultaneously.

vcf-concat A.vcf.gz B.vcf.gz C.vcf.gz | gzip -c > out.vcf.gz

(Read more)
About: Convenience tool for concatenating VCF files (e.g. VCFs split by chromosome).
   In the basic mode it does not do anything fancy except for a sanity check that all
   files have the same columns.  When run with the -s option, it will perform a partial
   merge sort, looking at limited number of open files simultaneously.
Usage: vcf-concat [OPTIONS] A.vcf.gz B.vcf.gz C.vcf.gz > out.vcf
Options:
   -c, --check-columns              Do not concatenate, only check if the columns agree.
   -f, --files <file>               Read the list of files from a file.
   -p, --pad-missing                Write '.' in place of missing columns. Useful for joining chrY with the rest.
   -s, --merge-sort <int>           Allow small overlaps in N consecutive files.
   -h, -?, --help                   This help message.
vcf-consensus

Apply VCF variants to a fasta file to create consensus sequence.

cat ref.fa | vcf-consensus file.vcf.gz > out.fa

(Read more)
Usage: cat ref.fa | vcf-consensus [OPTIONS] in.vcf.gz > out.txt
Options:
   -h, -?, --help                   This help message.
   -s, --sample <name>
vcf-convert

Convert between VCF versions, currently from VCFv3.3 to VCFv4.0.

zcat file.vcf.gz | vcf-convert -r reference.fa > out.vcf

(Read more)
About: Convert between VCF versions.
Usage: cat in.vcf | vcf-convert [OPTIONS] > out.vcf
Options:
   -r, --refseq <file>              The reference sequence in samtools faindexed fasta file. (Not required with SNPs only.)
   -v, --version <string>           4.0, 4.1
   -h, -?, --help                   This help message.
vcf-contrast

A tool for finding differences between groups of samples, useful in trio analysises, cancer genomes etc.

In the example below variants with average mapping quality of 30 (-f MinMQ=30) and minimum depth of 10 (-d 10) are considered. Only novel alleles are reported (-n). Then vcf-query is used to extract the INFO/NOVEL* annotations into a table. Finally the sites are sorted by confidence of the site being different in the child (-k5,5nr).

vcf-annotate -f MinMQ=30 file.vcf | vcf-contrast -n +Child -Mother,Father -d 10 -f | vcf-query -f '%CHROM %POS\t%INFO/NOVELTY\t%INFO/NOVELAL\t%INFO/NOVELGT[\t%SAMPLE %GTR %PL]\n' | sort -k3,3nr | head

(Read more)
About: Finds differences amongst samples adding NOVELGT, NOVELAL and NOVELTY annotations to INFO field.
       Note that haploid genotypes are internally treated as homozygous diploid genotypes, therefore
       "0/1" and "1" are considered different genotypes.
Usage: vcf-contrast +<list> -<list> [OPTIONS] file.vcf.gz
Options:
   +<list>                             List of samples where unique variant is expected
   -<list>                             List of background samples
   -d, --min-DP <int>                  Minimum depth across all -<list> samples
   -f, --apply-filters                 Skip sites with FILTER column different from PASS or "."
   -n, --novel-sites                   Print only records with novel genotypes
   -h, -?, --help                      This help message.
Example:
   # Test if any of the samples A,B is different from all C,D,E
   vcf-contrast +A,B -C,D,E -m file.vcf.gz

   # Same as above but printing only sites with novel variants and table output
   vcf-contrast -n +A,B -C,D,E -m file.vcf.gz | vcf-query -f '%CHROM %POS\t%INFO/NOVELTY\t%INFO/NOVELAL\t%INFO/NOVELGT[\t%SAMPLE %GTR %PL]\n'

   # Similar to above but require minimum mapping quality of 20
   vcf-annotate -f MinMQ=20 file.vcf.gz | vcf-contrast +A,B,C -D,E,F -f
vcf-filter

Please take a look at vcf-annotate which does what you are looking for. Apologies for the non-intuitive naming. Note: a fast htslib C version of a filtering tool with somewhat different capabilities is now available (see vcf filter).

vcf-fix-ploidy

Fixes diploid vs haploid genotypes on sex chromosomes, including the pseudoautosomal regions.

(Read more)
Usage: cat broken.vcf | vcf-fix-ploidy [OPTIONS] > fixed.vcf
Options:
   -a, --assumed-sex <sex>         M or F, required if the list is not complete in -s
   -l, --fix-likelihoods           Add or remove het likelihoods (not the default behaviour)
   -p, --ploidy <file>             Ploidy definition. The default is shown below.
   -s, --samples <file>            List of sample sexes (sample_name [MF]).
   -h, -?, --help                  This help message.
Default ploidy definition:
   ploidy =>
   {
       X =>
       [
           # The pseudoautosomal regions 60,001-2,699,520 and 154,931,044-155,270,560 with the ploidy 2
           { from=>1, to=>60_000, M=>1 },
           { from=>2_699_521, to=>154_931_043, M=>1 },
       ],
       Y =>
       [
           # No chrY in females and one copy in males
           { from=>1, to=>59_373_566, M=>1, F=>0 },
       ],
       MT =>
       [
           # Haploid MT in males and females
           { from=>1, to => 16_569, M=>1, F=>1 },
       ],
   }
vcf-indel-stats

Calculate in-frame ratio.

(Read more)
About: Currently calculates in-frame ratio.
Usage: vcf-indel-stats [OPTIONS] < in.vcf > out.txt
Options:
   -h, -?, --help                   This help message.
   -e, --exons <file>               Tab-separated file with exons (chr,from,to; 1-based, inclusive)
   -v, --verbose
vcf-isec

Creates intersections and complements of two or more VCF files. Given multiple VCF files, it can output the list of positions which are shared by at least N files, at most N files, exactly N files, etc. The first example below outputs positions shared by at least two files and the second outputs positions present in the files A but absent from files B and C.

vcf-isec -n +2 A.vcf.gz B.vcf.gz | bgzip -c > out.vcf.gz
vcf-isec -c A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz


Note: a fast htslib C version of this tool is now available.
(Read more)
About: Create intersections, unions, complements on bgzipped and tabix indexed VCF or tab-delimited files.
   Note that lines from all files can be intermixed together on the output, which can yield
   unexpected results.
Usage: vcf-isec [OPTIONS] file1.vcf file2.vcf ...
Options:
   -a, --apply-filters                 Ignore lines where FILTER column is anything else than PASS or '.'
   -c, --complement                    Output positions present in the first file but missing from the other files.
   -d, --debug                         Debugging information
   -f, --force                         Continue even if the script complains about differing columns, VCF versions, etc.
   -o, --one-file-only                 Print only entries from the left-most file. Without -o, all unique positions will be printed.
   -n, --nfiles [+-=]<int>             Output positions present in this many (=), this many or more (+), or this many or fewer (-) files.
   -p, --prefix <path>                 If present, multiple files will be created with all possible isec combinations. (Suitable for Venn Diagram analysis.)
   -r, --regions <list|file>           Do only the given regions (comma-separated list or one region per line in a file).
   -t, --tab <chr:pos:file>            Tab-delimited file with indexes of chromosome and position columns. (1-based indexes)
   -w, --win <int>                     In repetitive sequences, the same indel can be called at different positions. Consider
                                           records this far apart as matching (be it a SNP or an indel).
   -h, -?, --help                      This help message.
Examples:
   bgzip file.vcf; tabix -p vcf file.vcf.gz
   bgzip file.tab; tabix -s 1 -b 2 -e 2 file.tab.gz
vcf-merge

Merges two or more VCF files into one so that, for example, if two source files had one column each, on output will be printed a file with two columns. See also vcf-concat for concatenating VCFs split by chromosome.

vcf-merge A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz

Note that this script is not intended for concatenating VCF files. For this, use vcf-concat instead.
Note: a fast htslib C version of this tool is now available.

(Read more)
About: Merges VCF files by position, creating multi-sample VCFs from fewer-sample VCFs.
   The tool requires bgzipped and tabix indexed VCF files on input. (E.g. bgzip file.vcf; tabix -p vcf file.vcf.gz)
   If you need to concatenate VCFs (e.g. files split by chromosome), look at vcf-concat instead.
Usage: vcf-merge [OPTIONS] file1.vcf file2.vcf.gz ... > out.vcf
Options:
   -c, --collapse <snps|indels|both|any|none>      treat as identical sites with differing alleles [any]
   -d, --remove-duplicates                         If there should be two consecutive rows with the same chr:pos, print only the first one.
   -H, --vcf-header <file>                         Use the VCF header
   -h, -?, --help                                  This help message.
   -r, --regions <list|file>                       Do only the given regions (comma-separated list or one region per line in a file).
   -R, --ref-for-missing <string>                  Use the REF allele instead of the default missing genotype. Because it is not obvious
                                                       what ploidy should be used, a user-defined string is used instead (e.g. 0/0).
   -s, --silent                                    Try to be a bit more silent, no warnings about duplicate lines.
   -t, --trim-ALTs                                 If set, redundant ALTs will be removed
vcf-phased-join

Concatenates multiple overlapping VCFs preserving phasing.

(Read more)
About: The script takes multiple overlapping pre-phased chunks and concatenates them into one VCF
   using heterozygous calls from the overlaps to determine correct phase.
Usage: vcf-phased-join [OPTIONS] A.vcf B.vcf C.vcf
Options:
   -j, --min-join-quality <num>    Quality threshold for gluing the pre-phased blocks together [10]
   -l, --list <file>               List of VCFs to join.
   -o, --output <file>             Output file name. When "-" is supplied, STDOUT and STDERR will be used
   -q, --min-PQ <num>              Break pre-phased segments if PQ value is lower in input VCFs [0.6]
   -h, -?, --help                  This help message
vcf-query

Powerful tool for converting VCF files into format defined by the user. Supports retrieval of subsets of positions, columns and fields.

vcf-query file.vcf.gz 1:10327-10330
vcf-query file.vcf -f '%CHROM:%POS %REF %ALT [ %DP]\n'

(Read more)
Usage: vcf-query [OPTIONS] file.vcf.gz
Options:
   -c, --columns <file|list>           List of comma-separated column names or one column name per line in a file.
   -f, --format <string>               The default is '%CHROM:%POS\t%REF[\t%SAMPLE=%GT]\n'
   -l, --list-columns                  List columns.
   -r, --region chr:from-to            Retrieve the region. (Runs tabix.)
       --use-old-method                Use old version of API, which is slow but more robust.
   -h, -?, --help                      This help message.
Expressions:
   %CHROM          The CHROM column (similarly also other columns)
   %GT             Translated genotype (e.g. C/A)
   %GTR            Raw genotype (e.g. 0/1)
   %INFO/TAG       Any tag in the INFO column
   %LINE           Prints the whole line
   %SAMPLE         Sample name
   []              The brackets loop over all samples
   %*<A><B>        All format fields printed as KEY<A>VALUE<B>
Examples:
   vcf-query file.vcf.gz 1:1000-2000 -c NA001,NA002,NA003
   vcf-query file.vcf.gz -r 1:1000-2000 -f '%CHROM:%POS\t%REF\t%ALT[\t%SAMPLE:%*=,]\n'
   vcf-query file.vcf.gz -f '[%GT\t]%LINE\n'
   vcf-query file.vcf.gz -f '[%GT\ ]%LINE\n'
   vcf-query file.vcf.gz -f '%CHROM\_%POS\t%INFO/DP\t%FILTER\n'
Note: a fast htslib C version of this tool is now available (see vcf query).
vcf-shuffle-cols

Reorder columns

vcf-shuffle-cols -t template.vcf.gz file.vcf.gz > out.vcf

(Read more)
About: Reorder columns to match the order in the template VCF.
Usage: vcf-shuffle-cols [OPTIONS] -t template.vcf.gz file.vcf.gz > out.vcf
Options:
   -t, --template <file>            The file with the correct order of the columns.
   -h, -?, --help                   This help message.
vcf-sort

Sort a VCF file.

vcf-sort file.vcf.gz

(Read more)
Usage: vcf-sort > out.vcf
       cat file.vcf | vcf-sort > out.vcf
Options:
   -c, --chromosomal-order         Use natural ordering (1,2,10,MT,X) rather then the default (1,10,2,MT,X). This requires
                                       new version of the unix "sort" command which supports the --version-sort option.
   -t, --temporary-directory       Use a directory other than /tmp as the temporary directory for sorting.
   -h, -?, --help                  This help message.
vcf-stats

Outputs some basic statistics: the number of SNPs, indels, etc.

vcf-stats file.vcf.gz

(Read more)
Usage: vcf-stats [OPTIONS] file.vcf.gz
Options:
   -d, --dump <file>                           Take an existing dump file and recreate the files (works with -p)
   -f, --filters <filter1,filter2>             List of filters such as column/field (any value), column/field=bin:max (cluster in bins),column/field=value (exact value)
   -p, --prefix <dir/string>                   Prefix of output files. If slashes are present, directories will be created.
   -s, --samples <list>                        Process only the listed samples, - for none. Excluding unwanted samples may increase performance considerably.
   -h, -?, --help                              This help message.

Examples:
   # Calculate stats separately for the filter field, quality and non-indels
   vcf-stats file.vcf.gz -f FILTER,QUAL=10:200,INFO/INDEL=False -p out/

   # Calculate stats for all samples
   vcf-stats file.vcf.gz -f FORMAT/DP=10:200 -p out/

   # Calculate stats only for the sample NA00001
   vcf-stats file.vcf.gz -f SAMPLE/NA00001/DP=1:200 -p out/

   vcf-stats file.vcf.gz > perl.dump
vcf-subset

Remove some columns from the VCF file.

vcf-subset -c NA0001,NA0002 file.vcf.gz | bgzip -c > out.vcf.gz

(Read more)
Usage: vcf-subset [OPTIONS] in.vcf.gz > out.vcf
Options:
   -a, --trim-alt-alleles           Remove alternate alleles if not found in the subset
   -c, --columns <string>           File or comma-separated list of columns to keep in the vcf file. If file, one column per row
   -e, --exclude-ref                Exclude rows not containing variants.
   -f, --force                      Proceed anyway even if VCF does not contain some of the samples.
   -p, --private                    Print only rows where only the subset columns carry an alternate allele.
   -r, --replace-with-ref           Replace the excluded types with reference allele instead of dot.
   -t, --type <list>                Comma-separated list of variant types to include: ref,SNPs,indels,MNPs,other.
   -u, --keep-uncalled              Do not exclude rows without calls.
   -h, -?, --help                   This help message.
Examples:
   cat in.vcf | vcf-subset -r -t indels -e -c SAMPLE1 > out.vcf
vcf-tstv

A lightweight script for quick calculation of Ts/Tv ratio.

cat file.vcf | vcf-tstv

(Read more)
Usage: cat file.vcf | vcf-tstv
Options:
   -h, -?, --help                  This help message.
vcf-to-tab

A simple script which converts the VCF file into a tab-delimited text file listing the actual variants instead of ALT indexes.

zcat file.vcf.gz | vcf-to-tab > out.tab

(Read more)
Usage: vcf-to-tab [OPTIONS] < in.vcf > out.tab
Options:
   -h, -?, --help                   This help message.
   -i, --iupac                      Use one-letter IUPAC codes
vcf-validator

vcf-validator file.vcf.gz

(Read more)
Usage: vcf-validator [OPTIONS] file.vcf.gz
Options:
   -d, --duplicates                 Warn about duplicate positions.
   -u, --unique-messages            Output all messages only once.
   -h, -?, --help                   This help message.
Vcf.pm

For examples how to use the Perl API, it is best to look at some of the simpler scripts, for example vcf-to-tab. The detailed documentation can be obtained by running

perldoc Vcf.pm

(Read more)





NAME

Vcf.pm. Module for validation, parsing and creating VCF files. Supported versions: 3.2, 3.3, 4.0, 4.1

SYNOPSIS

From the command line: perl -MVcf -e validate example.vcf perl -I/path/to/the/module/ -MVcf -e validate_v32 example.vcf

From a script: use Vcf;

    my $vcf = Vcf->new(file=>'example.vcf.gz',region=>'1:1000-2000');
    $vcf->parse_header();

    # Do some simple parsing. Most thorough but slowest way how to get the data.
    while (my $x=$vcf->next_data_hash())
    {
        for my $gt (keys %{$$x{gtypes}})
        {
            my ($al1,$sep,$al2) = $vcf->parse_alleles($x,$gt);
            print "\t$gt: $al1$sep$al2\n";
        }
        print "\n";
    }

    # This will split the fields and print a list of CHR:POS
    while (my $x=$vcf->next_data_array())
    {
        print "$$x[0]:$$x[1]\n";
    }

    # This will return the lines as they were read, including the newline at the end
    while (my $x=$vcf->next_line())
    {
        print $x;
    }

    # Only the columns NA00001, NA00002 and NA00003 will be printed.
    my @columns = qw(NA00001 NA00002 NA00003);
    print $vcf->format_header(\@columns);
    while (my $x=$vcf->next_data_array())
    {
        # this will recalculate AC and AN counts, unless $vcf->recalc_ac_an was set to 0
        print $vcf->format_line($x,\@columns);
    }

    $vcf->close();

validate

    About   : Validates the VCF file.
    Usage   : perl -MVcf -e validate example.vcf.gz     # (from the command line)
              validate('example.vcf.gz');               # (from a script)
              validate(\*STDIN);
    Args    : File name or file handle. When no argument given, the first command line
              argument is interpreted as the file name.

validate_v32

    About   : Same as validate, but assumes v3.2 VCF version.
    Usage   : perl -MVcf -e validate_v32 example.vcf.gz     # (from the command line)
    Args    : File name or file handle. When no argument given, the first command line
              argument is interpreted as the file name.

new

    About   : Creates new VCF reader/writer.
    Usage   : my $vcf = Vcf->new(file=>'my.vcf', version=>'3.2');
    Args    :
                fh      .. Open file handle. If neither file nor fh is given, open in write mode.
                file    .. The file name. If neither file nor fh is given, open in write mode.
                region  .. Optional region to parse (requires tabix indexed VCF file)
                silent  .. Unless set to 0, warning messages may be printed.
                strict  .. Unless set to 0, the reader will die when the file violates the specification.
                version .. If not given, '4.0' is assumed. The header information overrides this setting.

open

    About   : (Re)Open file. No need to call this explicitly unless reading from a different
              region is requested.
    Usage   : $vcf->open(); # Read from the start
              $vcf->open(region=>'1:12345-92345');
    Args    : region       .. Supported only for tabix indexed files

close

    About   : Close the filehandle
    Usage   : $vcf->close();
    Args    : none
        Returns : close exit status

next_line

    About   : Reads next VCF line.
    Usage   : my $vcf = Vcf->new();
              my $x   = $vcf->next_line();
    Args    : none

next_data_array

    About   : Reads next VCF line and splits it into an array. The last element is chomped.
    Usage   : my $vcf = Vcf->new();
              $vcf->parse_header();
              my $x = $vcf->next_data_array();
    Args    : Optional line to parse

set_samples

    About   : Parsing big VCF files with many sample columns is slow, not parsing unwanted samples may speed things a bit.
    Usage   : my $vcf = Vcf->new();
              $vcf->set_samples(include=>['NA0001']);   # Exclude all but this sample. When the array is empty, all samples will be excluded.
              $vcf->set_samples(exclude=>['NA0003']);   # Include only this sample. When the array is empty, all samples will be included.
              my $x = $vcf->next_data_hash();
    Args    : Optional line to parse

next_data_hash

    About   : Reads next VCF line and splits it into a hash. This is the slowest way to obtain the data.
    Usage   : my $vcf = Vcf->new();
              $vcf->parse_header();
              my $x = $vcf->next_data_hash();

              # Or having a VCF data line $line
              my $x = $vcf->next_data_hash($line);

    Args    : Optional line to parse.

parse_header

    About   : Reads (and stores) the VCF header.
    Usage   : my $vcf = Vcf->new(); $vcf->parse_header();
    Args    : silent .. do not warn about duplicate header lines

_next_header_line

    About   : Stores the header lines and meta information, such as fields types, etc.
    Args    : silent .. do not warn about duplicate column names

get_header_line

    Usage   : $vcf->get_header_line(key=>'INFO', ID=>'AC')
              $vcf->get_header_line(key=>'FILTER', ID=>'q10')
              $vcf->get_header_line(key=>'reference')
              $vcf->get_header_line(key=>'contig',ID=>'20')
    Args    : Header line filter as in the example above
    Returns : List ref of header line hashes matching the filter

add_header_line

    Usage   : $vcf->add_header_line({key=>'INFO', ID=>'AC',Number=>-1,Type=>'Integer',Description=>'Allele count in genotypes'})
              $vcf->add_header_line({key=>'reference',value=>'1000GenomesPilot-NCBI36'})
    Args    : Header line hash as in the example above
              Hash with additional parameters [optional]
                silent .. do not warn about existing header keys
                append .. append timestamp to the name of the new one
    Returns : 

remove_header_line

    Usage   : $vcf->remove_header_line(key=>'INFO', ID=>'AC')
    Args    :
    Returns : 

parse_header_line

    Usage   : $vcf->parse_header_line(q[##reference=1000GenomesPilot-NCBI36])
              $vcf->parse_header_line(q[##INFO=NS,1,Integer,"Number of Samples With Data"])
    Args    :
    Returns : 

_read_column_names

    About   : Stores the column names as array $$self{columns} and hash $$self{has_column}{COL_NAME}=index.
              The indexes go from 1.
    Usage   : $vcf->_read_column_names();
    Args    : none

_fake_column_names

    About   : When no header is present, fake column names as the default mandatory ones + numbers
    Args    : The number of genotype columns; 0 if no genotypes but FORMAT present; <0 if FORMAT and genotypes not present

format_header

    About   : Returns the header.
    Usage   : print $vcf->format_header();
    Args    : The columns to include on output [optional]

format_line

    About   : Returns the header.
    Usage   : $x = $vcf->next_data_hash(); print $vcf->format_line($x);
              $x = $vcf->next_data_array(); print $vcf->format_line($x);
    Args 1  : The columns or hash in the format returned by next_data_hash or next_data_array.
         2  : The columns to include [optional]

recalc_ac_an

    About   : Control if the AC and AN values should be updated.
    Usage   : $vcf->recalc_ac_an(1); $x = $vcf->next_data_hash(); print $vcf->format_line($x);
    Args 1  : 0 .. never recalculate
              1 .. recalculate if present
              2 .. recalculate if present and add if missing

get_tag_index

    Usage   : my $idx = $vcf->get_tag_index('GT:PL:DP:SP:GQ','PL',':');
    Arg 1   : Field
        2   : The tag to find
        3   : Tag separator
    Returns : Index of the tag or -1 when not found

remove_field

    Usage   : my $field = $vcf->remove_field('GT:PL:DP:SP:GQ',1,':');    # returns 'GT:DP:SP:GQ'
    Arg 1   : Field
        2   : The index of the field to remove
        3   : Field separator
    Returns : Modified string

replace_field

    Usage   : my $col = $vcf->replace_field('GT:PL:DP:SP:GQ','XX',1,':');    # returns 'GT:XX:DP:SP:GQ'
    Arg 1   : Field
        2   : Replacement
        3   : 0-based index of the field to replace
        4   : Field separator
    Returns : Modified string

get_info_field

    Usage   : my $line  = $vcf->next_line;
              my @items = split(/\t/,$line);
              $af = $vcf->get_info_field('DP=14;AF=0.5;DB','AF');    # returns 0.5
              $af = $vcf->get_info_field('DP=14;AF=0.5;DB','DB');    # returns 1
              $af = $vcf->get_info_field('DP=14;AF=0.5;DB','XY');    # returns undef
    Arg 1   : The VCF line broken into an array
        2   : The tag to retrieve
    Returns : undef when tag is not present, the tag value if present, or 1 if flag is present

get_field

    Usage   : my $line  = $vcf->next_line;
              my @items = split(/\t/,$line);
              my $idx = $vcf->get_tag_index($$line[8],'PL',':');
              my $pl  = $vcf->get_field($$line[9],$idx) unless $idx==-1;
    Arg 1   : The VCF line broken into an array
        2   : The index of the field to retrieve
        3   : The delimiter [Default is ':']
    Returns : The tag value

get_sample_field

    Usage   : my $line  = $vcf->next_line;
              my @items = split(/\t/,$line);
              my $idx = $vcf->get_tag_index($$line[8],'PL',':');
              my $pls = $vcf->get_sample_field(\@items,$idx) unless $idx==-1;
    Arg 1   : The VCF line broken into an array
        2   : The index of the field to retrieve
    Returns : Array of values

split_mandatory

    About   : Faster alternative to regexs, extract the mandatory columns
    Usage   : my $line=$vcf->next_line; my @cols = $vcf->split_mandatory($line);
    Arg     :
    Returns : Pointer to the array of values

split_gt

    About   : Faster alternative to regexs
    Usage   : my ($a1,$a2,$a3) = $vcf->split_gt('0/0/1'); # returns (0,0,1)
    Arg     : Diploid genotype to split into alleles
    Returns : Array of values

split_by

    About   : Generalization of split_gt
    Usage   : my ($a1,$a2,$a3) = $vcf->split_gt('0/0|1',qw(| /)); # returns (0,0,1)
    Arg     : Diploid genotype to split into alleles
    Returns : Array of values

decode_genotype

    About   : Faster alternative to regexs
    Usage   : my $gt = $vcf->decode_genotype('G',['A','C'],'0/0'); # returns 'G/G'
    Arg   1 : Ref allele
          2 : Alt alleles
          3 : The genotype to decode
    Returns : Decoded GT string

validate_alt_field

    Usage   : my $x = $vcf->next_data_hash(); $vcf->validate_alt_field($$x{ALT});
    Args    : The ALT arrayref
    Returns : Error message in case of an error.

event_type

    Usage   :   my $x = $vcf->next_data_hash();
                my ($alleles,$seps,$is_phased,$is_empty) = $vcf->parse_haplotype($x,'NA00001');
                for my $allele (@$alleles)
                {
                    my ($type,$len,$ht) = $vcf->event_type($x,$allele);
                }
              or
                my ($type,$len,$ht) = $vcf->event_type($ref,$al);
    Args    : VCF data line parsed by next_data_hash or the reference allele
            : Allele
    Returns :   's' for SNP and number of SNPs in the record
                'i' for indel and a positive (resp. negative) number for the length of insertion (resp. deletion)
                'r' identical to the reference, length 0
                'o' for other (complex events) and the number of affected bases
                'b' breakend
                'u' unknown

has_AGtags

    About   : Checks the header for the presence of tags with variable number of fields (Number=A or Number=G, such as GL)
    Usage   : $vcf->parse_header(); my $agtags = $vcf->has_AGtags();
    Args    : None
    Returns : Hash {fmtA=>[tags],fmtG=>[tags],infoA=>[tags],infoG=>[tags]} or undef if none is present

parse_AGtags

    About   : Breaks tags with variable number of fields (that is where Number is set to 'A' or 'G', such as GL) into hashes
    Usage   : my $x = $vcf->next_data_hash(); my $values = $vcf->parse_AGtags($x);
    Args    : VCF data line parsed by next_data_hash
            : Mapping between ALT representations based on different REFs [optional]
            : New REF [optional]
    Returns : Hash {Allele=>Value}

format_AGtag

    About   : Format tag with variable number of fields (that is where Number is set to 'A' or 'G', such as GL)
    Usage   :
    Args    :
            :
            :
    Returns : 

parse_alleles

    About   : Deprecated, use parse_haplotype instead.
    Usage   : my $x = $vcf->next_data_hash(); my ($al1,$sep,$al2) = $vcf->parse_alleles($x,'NA00001');
    Args    : VCF data line parsed by next_data_hash
            : The genotype column name
    Returns : Alleles and the separator. If only one allele is present, $sep and $al2 will be an empty string.

parse_haplotype

    About   : Similar to parse_alleles, supports also multiploid VCFs.
    Usage   : my $x = $vcf->next_data_hash(); my ($alleles,$seps,$is_phased,$is_empty) = $vcf->parse_haplotype($x,'NA00001');
    Args    : VCF data line parsed by next_data_hash
            : The genotype column name
    Returns : Two array refs and two boolean flags: List of alleles, list of separators, and is_phased/empty flags. The values
                can be cashed and must be therefore considered read only!

format_haplotype

    Usage   : my ($alleles,$seps,$is_phased,$is_empty) = $vcf->parse_haplotype($x,'NA00001'); print $vcf->format_haplotype($alleles,$seps);

format_genotype_strings

    Usage   : my $x = { REF=>'A', gtypes=>{'NA00001'=>{'GT'=>'A/C'}}, FORMAT=>['GT'], CHROM=>1, POS=>1, FILTER=>['.'], QUAL=>-1 };
              $vcf->format_genotype_strings($x);
              print $vcf->format_line($x);
    Args 1  : VCF data line in the format as if parsed by next_data_hash with alleles written as letters.
         2  : Optionally, a subset of columns can be supplied. See also format_line.
    Returns : Modifies the ALT array and the genotypes so that ref alleles become 0 and non-ref alleles
                numbers starting from 1. If the key $$vcf{trim_redundant_ALTs} is set, ALT alleles not appearing
                in any of the sample column will be removed.

format_header_line

    Usage   : $vcf->format_header_line({key=>'INFO', ID=>'AC',Number=>-1,Type=>'Integer',Description=>'Allele count in genotypes'})
    Args    :
    Returns : 

remove_columns

    Usage   : my $rec=$vcf->next_data_hash(); $vcf->remove_columns($rec,remove=>['NA001','NA0002']);
    Args    : VCF hash pointer
            : list of columns to remove or a lookup hash with column names to keep (remove=>[] or keep=>{})
    Returns : 

add_columns

    Usage   : $vcf->add_columns('NA001','NA0002');
    Args    :
    Returns : 

add_format_field

    Usage   : $x=$vcf->next_data_hash(); $vcf->add_format_field($x,'FOO'); $$x{gtypes}{NA0001}{FOO}='Bar'; print $vcf->format_line($x);
    Args    : The record obtained by next_data_hash
            : The field name
    Returns : 

remove_format_field

    Usage   : $x=$vcf->next_data_hash(); $vcf->remove_format_field($x,'FOO'); print $vcf->format_line($x);
    Args    : The record obtained by next_data_hash
            : The field name
    Returns : 

add_info_field

    Usage   : $x=$vcf->next_data_array(); $$x[7]=$vcf->add_info_field($$x[7],'FOO'=>'value','BAR'=>undef,'BAZ'=>''); print join("\t",@$x)."\n";
    Args    : The record obtained by next_data_array
            : The INFO field name and value pairs. If value is undef and the key is present in $$x[7],
                it will be removed. To add fields without a value, use empty string ''.
    Returns : The formatted INFO.

add_filter

    Usage   : $x=$vcf->next_data_array(); $$x[6]=$vcf->add_filter($$x[6],'SnpCluster'=>1,'q10'=>0); print join("\t",@$x)."\n";
    Args    : The record obtained by next_data_array or next_data_hash
            : The key-value pairs for filter to be added. If value is 1, the filter will be added. If 0, the filter will be removed.
    Returns : The formatted filter field.

validate_filter_field

    Usage   : my $x = $vcf->next_data_hash(); $vcf->validate_filter_field($$x{FILTER});
    Args    : The FILTER arrayref
    Returns : Error message in case of an error.

validate_header

    About   : Version specific header validation code.
    Usage   : my $vcf = Vcf->new(); $vcf->parse_header(); $vcf->validate_header();
    Args    :

validate_line

    About   : Version specific line validation code.
    Usage   : my $vcf = Vcf->new(); $vcf->parse_header(); $x = $vcf->next_data_hash; $vcf->validate_line($x);
    Args    :

validate_info_field

    Usage   : my $x = $vcf->next_data_hash(); $vcf->validate_info_field($$x{INFO},$$x{ALT});
    Args    : The INFO hashref
    Returns : Error message in case of an error.

validate_gtype_field

    Usage   : my $x = $vcf->next_data_hash(); $vcf->validate_gtype_field($$x{gtypes}{NA00001},$$x{ALT},$$x{FORMAT});
    Args    : The genotype data hashref
              The ALT arrayref
    Returns : Error message in case of an error.

run_validation

    About   : Validates the VCF file.
    Usage   : my $vcf = Vcf->new(file=>'file.vcf'); $vcf->run_validation('example.vcf.gz');
    Args    : File name or file handle.

get_chromosomes

    About   : Get list of chromosomes from the VCF file. Must be bgzipped and tabix indexed.
    Usage   : my $vcf = Vcf->new(); $vcf->get_chromosomes();
    Args    : none

get_samples

    About   : Get list of samples.
    Usage   : my $vcf = Vcf->new(); $vcf->parse_header(); my (@samples) = $vcf->get_samples();
    Args    : none

get_column

    About   : Convenient way to get data for a sample
    Usage   : my $rec = $vcf->next_data_array(); my $sample_col = $vcf->get_column($rec, 'NA0001');
    Args 1  : Array pointer returned by next_data_array
         2  : Column/Sample name

get_column_name

    About   : Mapping between zero-based VCF column and its name
    Usage   : my $vcf = Vcf->new(); $vcf->parse_header(); my $name = $vcf->get_column_name(1); # returns POS
    Args    : Index of the column (0-based)

get_column_index

    About   : Mapping between VCF column name and its zero-based index
    Usage   : my $vcf = Vcf->new(); $vcf->parse_header(); my $name = $vcf->get_column_index('POS'); # returns 1
    Args    : Name of the column

VCFv4.0

VCFv4.0 specific functions

parse_header_line

    Usage   : $vcf->parse_header_line(q[##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">])
              $vcf->parse_header_line(q[reference=1000GenomesPilot-NCBI36])
    Args    :
    Returns : 

fill_ref_alt_mapping

    About   : A tool for merging VCFv4.0 records. The subroutine unifies the REFs and creates a mapping
                from the original haplotypes to the haplotypes based on the new REF. Consider the following
                example:
                    REF ALT
                    G    GA
                    GT   G
                    GT   GA
                    GT   GAA
                    GTC  G
                    G    <DEL>
                my $map={G=>{GA=>1},GT=>{G=>1,GA=>1,GAA=>1},GTC=>{G=>1},G=>{'<DEL>'=>1}};
                my $new_ref=$vcf->fill_ref_alt_mapping($map);

              The call returns GTC and $map is now
                    G    GA     ->      GTC  GATC
                    GT   G      ->      GTC  GC
                    GT   GA     ->      GTC  GAC
                    GT   GAA    ->      GTC  GAAC
                    GTC  G      ->      GTC  G
                    G    <DEL>  ->      GTC  <DEL>
    Args    :
    Returns : New REF string and fills the hash with appropriate ALT.

normalize_alleles

    About   : Makes REF and ALT alleles more compact if possible (e.g. TA,TAA -> T,TA)
    Usage   : my $line = $vcf->next_data_array();
              ($ref,@alts) = $vcf->normalize_alleles($$line[3],$$line[4]);

VCFv4.1

VCFv4.1 specific functions