Testing independence between 5' and 3' terminal sequences in a DNA database


I have a small database of ~ 1000 biologically validated DNA sequences, varying in length 5-20KB. From each of them, I have extracted 1000bp of their 5' terminal sequences,
and separately 1000bp of their 3' terminal sequences. How can I test for independence of these 5' and 3' sequences?

If these were numbers instead of sequences, I would perform a Chi-Square test, right? But what is the equivalent test for DNA sequence independence? In other words, does presence of a certain 5' terminal sequence correlate (directly or inversely) with a certain
3' terminal sequence and vice versa?

I cannot think of how to perform this independence test directly using DNA sequences, so I seek BioStars help for this.

Currently, I am thinking of this pipeline:

  1. Cluster 5' terminal sequences at varying identities, record the cluster memberships.
  2. Repeat step 1 in exactly the same way, for 3' sequences.
  3. For each of my 1000 DNA sequences, report cluster memberships separately, for both ends, and at different clustering ID %
  4. From table generated in step 3, determine if and and what % identity, there appears to be any correlation - would this step be based on Multinomial logistic regression? (note to self: wiki link)

Please suggest any changes in approach and / or implementation. Thanks in advance!







Source link