Simple metric to remove outliers from MSA


I did an MSA for a group of proteins and some sequences are really wrong. The average of aa in these sequences is about 76, but some unique sequences have 226 aa, others 31 or 54, that is, outliers. I want to avoid using another tool in the pipeline just to remove those sequences. Is there a simple metric that I can use to cut this sequences that I can use? Probably something that cut sequences that deviates from the average length of the sequences. But I need something more rigorous that this to justify this choice. Someone could help with this?





I did an MSA for a group of proteins and some sequences are really wrong.

In the immortal words of Big Lebowsky: Well, that's just like your opinion, man. I don't think sequences are right or wrong because their length is not what we expect. They can be incomplete (truncated) or have an additional domain that makes them larger. The way I think about it, the aligned sequences are either related or not. If they are related, the length is irrelevant, because the fragmented sequences may have some evolutionary signal in them that is worth preserving.

On the other hand, for alignment visualization purposes it may be desirable to remove sequences that are too long or too short. I don't think you need to justify that other than to say that sequences larger than size X or smaller than size Y were removed for the purpose of cleaner visualization.

before adding your answer.

Traffic: 2222 users visited in the last hour

Source link