How can I find matches between two csv files according to multiple columns using awk/bash?

1

*Reposted question with no images, removed useless information.

I want to compare File 1 and 2 in two ways:

  • First, match "subject id" in File 2 with "seqnames" in File 1
  • Then, see if File 2's s.start-s.end range overlaps or falls within File 1's start and end range.

If such matches occur, I would like to print out columns 2-23 from File 1, and then every column but #2 (subject id) from File 2 to the right of that.

File 1:

strand  seqnames      start  end    width  [18moreColumns]
1       Ha412HOChr01  1000   10000  9000    ...
2       Ha412HOChr01  19000  21000  2000    ...
3       Ha412HOChr08  2000   11000  9000    ...
4       Ha412HOChr17  92500  94400  1900    ...

File 2:

query id  subject id    % identity  alignment  mismatch  gap open  s.start  s.end  [3 more columns]
XP_02299  Ha412HOChr01  100         335        0         0         2800     3200   ...
XP_02356  Ha412HOChr01  99.86       189        2         1         18900    19800  ...
XP_04271  Ha412HOChr03  98.21       431        4         0         2100     2432   ...
XP_05788  Ha412HOChr08  93.29       38         0         0         7600     8210   ...
XP_06354  Ha412HOChr08  97.89       98         0         0         32100    34980  ...

Output:

seqnames      start  end    width  [18moreColumns]  query id  % identity  alignment  mismatch  gap open  s.start  s.end  [3 more columns]
Ha412HOChr01  1000   10000  9000    ...             XP_02299  100         335        0         0         2800     3200   ...
Ha412HOChr01  19000  21000  2000    ...             XP_02356  99.86       189        2         1         18900    19800  ...
Ha412HOChr08  2000   11000  9000    ...             XP_05788  93.29       38         0         0         7600     8210   ...


awk


genomics


bash

• 58 views



Source link