gravatar for nina.maryn

2 hours ago by


This is my first post and my first big bioinformatics project. I'm new to command line and the tools for analyzing vcf files, so please be kind!

I have two VCF files, each representing GATK filtered Indels for RNA-seq in 6 tissues from two divergent species, A & B. They are aligned to their closest domesticated relative, C, whose genome is sequenced. We want to ultimately create "pseudo-reference" genomes for A and B using their RNA-seq data to be able to sequence more closely related species to A & B and align them more accurately in the future. At the current step, we want to filter out ALT indels that have frequency differences <= 0.6 between A & B, so that we only keep differences that are fixed between the species.

How do I compare AF site by site between two VCF files? I want to (1) keep unique indels with an ALT AF >=0.6 in each species and (2) keep shared indels where the ALT AF difference between A & B is >= 0.6. I ultimately need to end up with two vcf files again, for A and B.

I first tried importing the vcf files into R using vcfR, creating data frames, and filtering with dyplr. This worked, but I couldn't figure out an easy way to recapitulate my vcf files for further processing. (vcf2tidy is a one way street apparently and vcfR won't create a new vcf object from a data frame.) Is there a way to create a new vcf file or use my filtered CHROM & POS information to filter a vcf file? Or is there some easier way (maybe using bcftools) to compare and filter my vcfs?


Source link