gravatar for venura

8 hours ago by

Hi,

I have a document (its 100 pages long and only two instances were displayed below) with the following output came from NSite (softberry);

    QUERY: STBZIP38
     Length of Query Sequence:       2000 bp     | Nucleotide Frequencies:  A -  0.34   G -  0.16   T -  0.35   C -  0.15


     TFBS AC: RSP00073//OS: tobacco (Nicotiana tabacum) /GENE: synthetic oligonucleotides/TFBS: PA /BF: TAF-1
     Motifs on "+" Strand: Mean Exp. Number   0.00391     Up.Conf.Int.  1     Found   1
         421  tCCACGTGGC      430 (Mism.= 1)

     Motifs on "-" Strand: Mean Exp. Number   0.00391     Up.Conf.Int.  1     Found   1
         430  GCCACGTGGa      421 (Mism.= 1)

     TFBS AC: RSP00153//OS: Parsley, Petroselinum crispum /GENE: CHS/TFBS: Box II /BF: CPRF-1; CPRF-2; CPRF-3;
     Motifs on "+" Strand: Mean Exp. Number   0.00358     Up.Conf.Int.  1     Found   1
         422  CCACGTGGCa      431 (Mism.= 1)

     TFBS AC: RSP00154//OS: parsley (Petroselinum crispum) /GENE: CHS/TFBS: ACE (CHS) /BF: bZIP factors CPRF1, CPRF4
     Motifs on "+" Strand: Mean Exp. Number   0.00358     Up.Conf.Int.  1     Found   1
         422  CCACGTGGCa      431 (Mism.= 1)
Totally      50 motifs of    43 different TFBSs have been found
____________________________________________________________

 QUERY: STBZIP17
 Length of Query Sequence:       2000 bp     | Nucleotide Frequencies:  A -  0.37   G -  0.13   T -  0.39   C -  0.11


 TFBS AC: RSP00577//OS: tomato (Lycopersicon esculentum), Lycopersicon esculentum /GENE: rbcS3A/TFBS: AT-rich FF2 /BF: unknown nuclear factor
 Motifs on "-" Strand: Mean Exp. Number   0.00187     Up.Conf.Int.  1     Found   1
     206  AATAATTAaAcATTAATTAA      187 (Mism.= 2)

 TFBS AC: RSP00797//OS: potato (Solanum tuberosum) /GENE: patatin 21/TFBS: SURE-1 /BF: SURF
 Motifs on "-" Strand: Mean Exp. Number   0.00440     Up.Conf.Int.  1     Found   1
    1027  TAAAGAATAaAAAAAaaAA     1009 (Mism.= 3)

 TFBS AC: RSP00864//OS: arabidopsis (Arabidopsis thaliana) /GENE: STK/TFBS: GA-5 /BF: BPC1
 Motifs on "-" Strand: Mean Exp. Number   0.00260     Up.Conf.Int.  1     Found   1
    1966  AGAGAGAGA     1958 (Mism.= 0)

The output I want is as follows;

STBZIP38    RSP00073//OS
STBZIP38    RSP00153//OS
STBZIP38    RSP00154//OS
STBZIP17    RSP00577//OS
STBZIP17    RSP00797//OS
STBZIP17    RSP00864//OS

First I tried with Regex (in python) with help from folks at StackOverflow. But to achieve the overall task it will need a very very long code (esp given the fact that the number of TFBS is different from one query to another).

pat1='STB.*d*'
pat2 = 'RSP.*OS'

m = re.findall(pat1,s)
n = re.findall(pat2, s)

#print(m, n)

print(m[0],  n[0])
print(m[0],  n[1])
print(m[0],  n[2])
print(m[1], n[3]) 
print(m[1],  n[4])
print(m[1],  n[5])

I really appreciate it if someone can help me to come up with either python or bash script. Thanks in advance.



Source link