Hello there,

The top-level fasta file will include chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions. See more here: ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna/README. If you are only looking for reference genome assembly chromosome level sequences then use the primary_assembly.fa file.

The files in the dna_index directory are genomic sequence files which are bgzipped and tabix indexed (for more details on what this means see: www.htslib.org/doc/tabix.html). These are downloaded by the Variant Effect Predictor (VEP) installer to allow quicker VEP'ing. The fasta file without the .fai or .gzi suffix, although stated to be a different size, is identical to the fasta file in the fasta/mus_musculus/dna/ folder so you can download either and you'd get the same data.

We'll update the README files, or 'hide' the dna_index folder to avoid confusion between these files in the two folders. Thanks for bringing it to our attention!

From ensembl (emphasis mine)

TOPLEVEL

These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not
assembled into chromosomes and N padded haplotype/patch regions.

From the STAR manual (emphasis mine)

2.2.1 Which chromosomes/scaffolds/patches to include?

It is strongly recommended to include major chromosomes (e.g., for human
chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized
scaffolds. Typically, un-placed/un-localized scaffolds add just a few
MegaBases to the genome length, however, a substantial number of reads
may map to ribosomal RNA (rRNA) repeats on these scaffolds. These
reads would be reported as unmapped if the scaffolds are not included
in the genome, or, even worse, may be aligned to wrong loci on the
chromosomes. Generally, patches and alternative haplotypes should not
be included in the genome.
Examples of acceptable genome sequence
files:
• ENSEMBL: files marked with .dna.primary.assembly, such as:

ftp.ensembl.org/
pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.
fa.gz

So I'd say no, you don't have the right reference. Use "primary assembly" as recommended.

If you want to analyse haplotypes you have the good fasta file.

The GTF is the good one

Becareful, chromosome names are not "standard" and could struggle some aligners. In your file chr1 is named 1, maybe you would have to rename each chromosome chr1, ch2 etc


Login
before adding your answer.

Traffic: 2484 users visited in the last hour



Source link