Answering on your comment that you have it now in fasta format you can use either R with Biostrings (which you seem to be learning now) or awk.

Example data:

cat test.fa
>chr1
ATGCTAGCTAGCATCG
>chr2
TAGC
>chr3
GATCGATCGATCG
>chr4
TGACTGATCGACTAGCTAGCTACGTACGTACGATGCA
>chr5
GATCGATCGTACGATCG

1) Solution in R:

library(Biostrings)

#/ read fasta (for amino acids I think it is readAAStringSet):
fa <- readDNAStringSet("test.fa")

#/ get shortest and longest via width()
w <- width(fa)
fa_final <- fa[c(which(w==min(w)), which(w==max(w)))]

#/ save back to disk:
writeXStringSet(fa_final, "test2.fa")

2) Solution with awk (people much better at awk than me can for sure squeeze this into a single command):

awk '/^>/ {printf("%s%st",(N>0?"n":""),$0);N++;next;} {printf("%s",$0);} END {printf("n");}' < test.fa 
| awk 'OFS="t" {print $1, $2, length($2) | "sort -k3,3n"}' 
| awk '{ if(NR ==1){print $1"n"$2 }}END {print $1"n"$2}'
>chr2
TAGC
>chr4
TGACTGATCGACTAGCTAGCTACGTACGTACGATGCA

First linearize the fasta (two columns tab separated), then print an additional column with the seq length, sort by length so shortest is the first and longest the last entry, then select first and last entry, and write back to fasta format.



Source link