# How to find the fasta file with maxium number of amino acids?

Answering on your comment that you have it now in fasta format you can use either R with Biostrings (which you seem to be learning now) or `awk`.

Example data:

``````cat test.fa
>chr1
ATGCTAGCTAGCATCG
>chr2
TAGC
>chr3
GATCGATCGATCG
>chr4
TGACTGATCGACTAGCTAGCTACGTACGTACGATGCA
>chr5
GATCGATCGTACGATCG
``````

1) Solution in R:

``````library(Biostrings)

#/ get shortest and longest via width()
w <- width(fa)
fa_final <- fa[c(which(w==min(w)), which(w==max(w)))]

#/ save back to disk:
writeXStringSet(fa_final, "test2.fa")
``````

2) Solution with `awk` (people much better at awk than me can for sure squeeze this into a single command):

``````awk '/^>/ {printf("%s%st",(N>0?"n":""),\$0);N++;next;} {printf("%s",\$0);} END {printf("n");}' < test.fa
| awk 'OFS="t" {print \$1, \$2, length(\$2) | "sort -k3,3n"}'
| awk '{ if(NR ==1){print \$1"n"\$2 }}END {print \$1"n"\$2}'
>chr2
TAGC
>chr4
TGACTGATCGACTAGCTAGCTACGTACGTACGATGCA
``````

First linearize the fasta (two columns tab separated), then print an additional column with the seq length, sort by length so shortest is the first and longest the last entry, then select first and last entry, and write back to fasta format.