I have been struggling with PopGenome for a while and I ran out of ideas. I have data in VCF format with SNPs from several fragments of a few genes, a reference fasta with the same names as in VCF, and GTF like:
1 VCF CDS 68 124 . + 2 gene_id "Gene.100265";
2 VCF CDS 126 405 . + 1 gene_id "Gene.100265";
3 VCF CDS 447 820 . + 1 gene_id "Gene.100265";
4 VCF CDS 864 1078 . + 1 gene_id "Gene.100265";
The genes are sequenced in a few hundred individuals. Each gene was sequenced in a few fragments and the fragments are the same for all individuals.
I do not know to make PopGenome include information on coding sequence. I used:
PGfile <- PopGenome::readData("./vcf/",gffpath = "./gtf/", format = "VCF", include.unknown = TRUE)
PGfile <- set.synnonsyn(PGfile, ref.chr=paste0("./FASTA/references_uncoded_246_.fasta"))
[email protected]@CodingSNPS are all TRUE, but [email protected]@ExonSNPS are all FALSE. For some reason I get only a fraction of information from [email protected]@codons and [email protected]@n.nucleotides which anyway is NULL.
I guess this might be a problem about gtf file format but found no clue on how it should be formatted for PopGenome. I used "Gene", "Exon" and "CDS" (as above) for the third column, but nothing has changed.
Does anyone have idea what I did wrong?
The session info is as follows (all necessary packages are loaded):
R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS