Hello,
I am trying to do partitioning my plastid alignment using Arabidopsis.gff file as reference. I wanted to first make a position file which had assigned keys and values to each position in the AT reference sequence.
NC_000932.1 RefSeq exon 527 551 . + . ID=exon-ArthCt098-1;Parent=rna-ArthCt098;Dbxref=GeneID:1466263;gbkey=tRNA;gene=trnT
NC_000932.1 RefSeq CDS 554 570 . + 0 ID=cds-NP_051054.1;Parent=gene-ArthCp017;Dbxref=Genbank:NP_051054.1,GeneID:844775;Name=NP_051054.1;gbkey=CDS;gene=psbD
NC_000932.1 RefSeq CDS 610 612 . + 0 ID=cds-NP_051055.1;Parent=gene-ArthCp018;Dbxref=Genbank:NP_051055.1,GeneID:844773;Name=NP_051055.1;Note=CP43;gbkey=CDS;gene=psbC
I wanted to define a range of numbers (or count with while loop), and give each location in the sequence a value according to the reference. So the script should search for the intervals in 4th and 5th columns and between that interval assigning each number as a key and 3rd column for respective values.
My desired output is below;
"1": "na"
"2": "na"
"3": "na"
.
.
"526": "na"
"527": "exon"
"528": "exon"
"529": "exon"
.
.
.
"550": "exon"
"551": "exon"
"552": "na"
"553": "na"
"554": "exon"
"555": "exon"
.
.
.
"570":"exon"
"571":"na"
.
.
.
"610": "na"
"611": "CDS"
"612": "CDS"
.
.
.
until the range ends.
I believe a while loop with a count is the correct way, but I am failing to end it correctly and it goes into an endless loop. I tried directly assigning the values as strings but I realize library is what I need because the data is too big to handle. I would be glad if you can show a way to solve this.