I currently have a dataframe that states particular gene clusters within genomes, this is defined as a well-formatted tab-delimited file, which looks basically like the dataframe below (Example):
Gene Cluster Genome
---------------------------------------
GCF3372 Streptomyces_hygroscopicus
GCF3450 Streptomyces_sp_Hm1069
GCF3371 Streptomyces_sp_MBT13
GCF3371 Streptomyces_xiamenensis
The general idea based on this I want to measure the occurrence of a GCF per Genome, as well as the co-occurrence of a given GCF with others in various genomes. For which, I want a presence absence table in order to be able to conduct a statistical analysis on this matrix. For this I thought that I could create an absence presence table or contingency table based on this dataframe with values of 0 and 1. This depending on the absence or presence of a particular gene cluster in a genome.
What type of visualizations also would anyone suggest for this type of data? The most what I thought best would be the use of a heatmap for this case!
I am not sure on where to go with this to be honest and would appreciate highly some orientation if possible! Thanks in advanced 🙂