You could use the BEDOPS kit to solve this.

$ awk -v FS="t" -v OFS="t" '{ n=split($2, a, ":"); print a[1], a[2]-1, a[2]; }' original_table.txt 
   | sort-bed - 
   > positions.bed

Get annotations:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz 
  | gunzip -c - 
  | awk -v FS="t" -v OFS="t" '{ if (!match($13, /.*-[0-9]+/)) { print $3, $5, $6, $13, ".", $4; } }' - 
  | sort-bed - 
  > refGene.hg38.bed

Then map:

$ bedmap --echo --echo-map-id-uniq --delim 't' --unmapped-val "." positions.bed refGene.hg38.bed > mapped.positions.bed

To put your mapped data into the desired format, you'll need to associate positions with your original file:

#!/usr/bin/env python

import sys

original_table_fn = sys.argv[1]
mapped_positions_fn = sys.argv[2]

''' build position table '''
t = {}
with open(original_table_fn, "r") as ifh:
    for line in ifh:
        elems = line.rstrip().split('t')
        t[elems[1]] = elems 

''' map genes to position table '''
with open(mapped_positions_fn, "r") as ifh:
    for line in ifh:
        (chrom, start, stop, genes) = line.rstrip().split('t')
        k = "{}:{}".format(chrom, stop)
        try:
            v = t[k]
            v[1] = genes # swap position key for mapped gene(s)
            sys.stdout.write('{}n'.format('t'.join(v)))
        except KeyError:
            sys.stderr.write('Warning: Position {} was not foundn'.format(k))

Usage:

$ ./reverse_map.py original_table.txt mapped.positions.bed > answer.txt



Source link