gravatar for vkkodali

1 hour ago by

United States

The file you are looking for is feature_table.txt.gz located in the same FTP directory where the genome FASTA, assembly report, GPFF, etc are located. For example, this is FTP path for the Salmonella assembly: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2 and you can find the feature_table.txt file in that directory. This is a tab-delimited file with information about all of the features annotated on the genome. Specifically, you can get the range of genes using something along the lines of:

zcat GCF_000006945.2_ASM694v2_feature_table.txt.gz 
  | awk 'BEGIN{FS="t";OFS="t"}($1~/^#/ || $1=="gene"){print $7,$8,$9,$10,$15,$16,$17}' 
genomic_accession  start  end    strand  symbol  GeneID   locus_tag
NC_003197.2        190    255    +       thrL    1251519  STM0001
NC_003197.2        325    2799   +       thrA    1251520  STM0002
NC_003197.2        2789   3730   +       thrB    1251521  STM0003
NC_003197.2        3722   5020   +       thrC    1251522  STM0004

Note, the coordinates in this table are 1-based and you should subtract 1 from the start position if you want to use bedtools for any downstream steps.



Source link