gravatar for biohacker_tobe

4 hours ago by

I currently have a dataframe that states particular gene clusters within genomes, this is defined as a well-formatted tab-delimited file, which looks basically like the dataframe below (Example):

Dataframe Example

    Gene Cluster     Genome
    GCF3372      Streptomyces_hygroscopicus
    GCF3450      Streptomyces_sp_Hm1069
    GCF3371      Streptomyces_sp_MBT13
    GCF3371      Streptomyces_xiamenensis

The general idea based on this I want to measure the occurrence of a GCF per Genome, as well as the co-occurrence of a given GCF with others in various genomes. For which, I want a presence absence table in order to be able to conduct a statistical analysis on this matrix. For this I thought that I could create an absence presence table or contingency table based on this dataframe with values of 0 and 1. This depending on the absence or presence of a particular gene cluster in a genome.

What type of visualizations also would anyone suggest for this type of data? The most what I thought best would be the use of a heatmap for this case!

I am not sure on where to go with this to be honest and would appreciate highly some orientation if possible! Thanks in advanced 🙂

Source link