As of July 2015, the VCFtools project has been moved to github! Please visit the new website here: vcftools.github.io/documentation.html
The C++ executable module examples
This page provides usage examples for the executable module. Extended documentation for all of the options can be found on the manual page.
- Running the program
- Getting basic file statistics
- Applying a filter
- Writing to a new VCF file
- Writing out to screen
- Converting a VCF file to BCF
- Comparing two VCF files
- Getting allele frequency
- Getting sequencing depth information
- Getting linkage disequilibrium statistics
- Getting Fst population statistics
- Converting VCF files to PLINK format
By default the executable can be found in the bin/ subdirectory. To run the program, type:
The program will return information regarding the version number.
The executable can be run with only an input VCF file without any other options, and will return basic information regarding the contents of the file. To specify an input file you must use the one of the input options ( --vcf, --gzvcf, or --bcf ) depending on the type of file. For example, for a VCF file called input_data.vcf the following command could be run:
./vcftools --vcf input_data.vcf
It will return information about the file such as the number of variants and the number of individuals in the file.
Beginning with vcftools v0.1.12, the program can also take input in from standard input (stdin). To do this, use any of the normal file type input options followed by the dash - character.
zcat input_data.vcf.gz | ./vcftools --vcf -
You can use VCFtools to filter out variants or individuals based on the values within the file. For example, to filter the sites within a file based upon their location in genome, use the options --chr, --from-bp, and --to-bp to specify the region.
./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000
After running this line, the program will return the amount of sites in the file that are included in the chromosomal region chr1:1000000-2000000. This option can be modified to work with any desired region.
VCFtools can perform analyses on the variants that pass through the filters or simply write those variants out to a new file. This function is helpful for creating subsets of VCF files or just removing unwanted variants from VCF files. To write out the variants that pass through filters use the --recode option. In addition, use --recode-INFO-all to include all data from the INFO fields in the output. By default INFO fields are not written because many filters will alter the variants in a file, rendering the INFO values incorrect.
./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode --recode-INFO-all
In this example, VCFtools will create a new VCF file containing only variants within the specified chromosomal region while keeping all INFO fields included in the original file.
Any files written out by VCFtools will be in the current working directory and have the prefix ./out.SUFFIX by default. To change the path, specify the new path using the option --out followed by the desired path. The program will add a suffix to that path based on the chosen output function.
./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode --out subset
Beginning with VCFtools v0.1.12, the program can also write out to screen instead of having the program write to a specified path. Using the options --stdout or -c will redirect all output to standard out. The output can then be piped into other programs or written out to a specified file name.
./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode --stdout | more
The above example will output the resulting file to screen one line at a time for quick inspection of the results.
./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode -c > /home/usr/data/subset.vcf
The above example will redirect the output and write it to the specified file name.
./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode -c | gzip -c > /home/usr/data/subset.vcf.gz
The above example will redirect the output into gzip (assuming it is installed) for compression, and then gzip will write the file to the specified destination.
Beginning with VCFftools v0.1.11, the program has the ability to read and write BCF files. This means that the program can also convert files between the two formats. This is accomplished in a similar way as the above example, instead using the --recode-bcf option. All output BCF files are automatically compressed using BGZF.
./vcftools --vcf input_data.vcf --recode-bcf --recode-INFO-all --out converted_output
Using VCFtools, two VCF files can be compared to determine which sites and individuals are shared between them. The first file is declared using the input file options just like any other output function. The second file must be specified using --diff, --gzdiff, or --diff-bcf. There are also advanced options to determine additional discordance between the two files.
./vcftools --vcf input_data.vcf --diff other_data.vcf --out compare
To determine the frequency of each allele over all individuals in a VCF file, the --freq argument is used.
./vcftools --vcf input_data.vcf --freq --out output
The output file will be written to output.frq.
Another useful output function summarizes sequencing depth for each individual or for each site. Just like the allele frequency example above, this output function follows the same basic model.
./vcftools --vcf input_data.vcf --depth -c > depth_summary.txt
With VCFtools, you can use many combinations of filters and an output function. For example, to write out site-wise sequence depths only at sites that have no missing data, include the --max-missing argument.
./vcftools --vcf input_data.vcf --site-depth --max-missing 1.0 --out site_depth_summary
Linkage disequilibrium between sites can be determined as well. This is accomplished using the --hap-r2, --geno-r2, or --geno-chisq arguments. Since the program must do pairwise site comparisons, this analysis can be time consuming, so it is recommended to filter the sites first or use one of the other options (--ld-window, --ld-window-bp or --min-r2) to reduce the number of comparisons. In this example, the VCFtools will only compare sites within 50,000 base pairs of one another.
./vcftools --vcf input_data.vcf --hap-r2 --ld-window-bp 50000 --out ld_window_50000
VCFtools can also calculate Fst statistics between individuals of different populations. It is an estimate calculated in accordance to Weir and Cockerham’s 1984 paper. The user must supply text files that contain lists of individuals (one per line) that are members of each population. The function will work with multiple populations if multiple --weir-fst-pop arguments are used. The following example shows how to calculate a per-site Fst calculation with two populations. Other arguments can be used in conjunction with this function, such as --fst-window-size and --fst-window-step.
./vcftools --vcf input_data.vcf --weir-fst-pop population_1.txt --weir-fst-pop population_2.txt --out pop1_vs_pop2
VCFtools can convert VCF files into formats convenient for use in other programs. One such example is the ability to convert into PLINK format. The following function will output the variants in .ped and .map files.
./vcftools --vcf input_data.vcf --plink --chr 1 --out output_in_plink