To get genes into BED format:

$ wget -qO- ftp://ftp.ensembl.org//pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.gtf.gz | gunzip -c | awk '{ print $0" transcript_id "";"}' | gtf2bed - | awk '($8=="gene")' > Homo_sapiens.GRCh38.103.gene.bed

It is necessary to add a blank transcript_id. Unfortunately, Ensembl GTF files do not follow specification, but the awk statement fixes that.

To prep reads:

$ sort-bed reads.bed > reads.sorted.bed

You may want to check that the chromosome field in reads.sorted.bed use the Ensembl chromosome naming scheme, i.e., 1, 2, etc. If your BED file uses the UCSC scheme (chr1, chr2, etc.) then the chr string needs to be removed, e.g. with awk or similar. Chromosome name schemes should match, or mapping will not work.

To map reads to each gene:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped Homo_sapiens.GRCh38.103.gene.bed reads.sorted.bed > reads_mapped_to_genes.bed

To map genes to each read:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped reads.sorted.bed Homo_sapiens.GRCh38.103.gene.bed > genes_mapped_to_reads.bed

Reference:



Source link