Hi all, I need some help with python and pandas.

I actually have a dataframe with in the column seq1_id al the seq_id of sequences of the species 1 and the column 2 for the sequences of the sp2.

I actually passed a filter on those sequences and got two dataframes (one with all sequences of sp 1 passed through the filter) and (one with all sequences of sp2 passed through the filter).

Then I have 3 dataframes.

Because in a pairs, one seq can passe the filter while the other does not, it is important to keep only paired genes wich are keeping on the two previous filtering, so what I need to do is acutally to parse my first df such this one:

Seq1.id      Seq2.id
seq1_01     seq5_02
seq2_01     Seq6_02
seq3_01     Seq7_02
seq4_01     Seq8_02

and check row by row if (ex the first row) seq1_01 is present in the df2 and if seq8_02 is also present in the df3, then keep this row in the df1 and add it in a new df4.

Here is an exemple with output wanted:

first df: 

 Seq1.id     Seq2.id
seq1_01     seq5_02
seq2_01     Seq6_02
seq3_01     Seq7_02
seq4_01     Seq8_02

df2 (sp1) (seq3_01 is absent)

    Seq_1.id   
    seq1_01     
    seq2_01        
    seq4_01 


df3 (sp2) (Seq8_02 is absent)

   Seq_2.id
   seq5_02
   Seq6_02
   Seq7_02

Then because Seq8_02 and seq3_01 are not present, the df4 (output) would be:

    Seq1.id   Seq2.id
    seq1_01     seq5_02
    seq2_01    Seq6_02

Hi tried:

HGT_candidats_0035=candidates_0035
HGT_candidats_0042=candidates_0042

#convert gene names into a list
gene_name_0035=[]
for i in HGT_candidats_0035["gene"]:
    gene_name_0035.append(i)

gene_name_0042=[]
for i in HGT_candidats_0042["gene"]:
    gene_name_0042.append(i)

#Keep only paired sequences
seq1_id=[]
for i in dN_dS["seq1_id"]:
    seq1_id.append(i)

seq2_id=[]
for i in dN_dS["seq2_id"]:
    seq2_id.append(i)

newdf = pd.DataFrame(columns=("seq1_id","seq2_id"))

for a, b in zip(seq1_id,seq2_id):
    if a in gene_name_0035 and b in gene_name_0042:
        newdf=newdf.append({"seq1_id":a,"seq2_id":b}, ignore_index=True)

But I think it is too long

Here is you code: with my data

candidates_0035=pd.read_csv("candidates_genes_filtering_0035",sep='t')
candidates_0042=pd.read_csv("candidates_genes_filtering_0042",sep='t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='t')

df4 = pd.DataFrame(columns=dN_dS.columns)
print(df4)
for index, row in dN_dS.iterrows():
    if row['seq1_id'] in candidates_0042['gene'] and row['seq2_id'] in candidates_0035['gene']:
        df4 = df4.append(row, ignore_index=True)

df4.to_csv("new_df",sep='t')

and here the empty output of df4:

Unnamed: 0  Unnamed: 0.1    seq1_id seq2_id dN  dS  Dist_third_pos  Dist_brute  Length_seq_1    Length_seq_2    GC_content_seq1 GC_content_seq2 GC  Mean_length

Here are the data:
drive.google.com/file/d/1FR9MUk4x0NoM-r3F4oe6dt5HgDMaUlKv/view?usp=sharing
drive.google.com/file/d/1MWRJwqRAA2B7eAXG1hcnIAqeQyjtx7pT/view?usp=sharing
drive.google.com/file/d/10ZP-Awx_qevKoT-AfMjDpd8KKaUcsEog/view?usp=sharing



Source link