Hello.

I would like to study how to analyze data from DNA sequencing experiments. Once such experiment is done, we obtain a file whose format is called FASTA. It is a text file where you have the reads and the sequences.

If I understood well, the read is a piece of DNA of the genome (I am talking about the sequencing of the entire genome).

My doubt is: the read is a piece of DNA located in a random position of the genome or it represents the sequence of a precise gene of the genome ?

If I download a fasta format file from NCBI, there is a line of description like this:

>lcl|CP003685.1_cds_AFN02977.1_1 [locus_tag=PFC_00005] [protein=hypothetical protein] [protein_id=AFN02977.1] [location=43..2532] [gbkey=CDS]

And then the sequence of nucleotides like this:

ATGAGGAAAAAACTTGTTGGAATATTGACAATATTGGTTGCTTTGGGCATGTTAGTAAGCCCACTTCTAA
AGCCAGTAGCAGCAGAGGATCAGAAGGTTCTTAAGATAGCAATGTACTCAGCAACTGGTTCTCTATTTAT
GGGTGCATGGAACCCAAGTTCAGCAGGTTTCAGAGATGTGTATTCAACTAGAGCTGCAGGGTTGGCCCAG
GATGAGGGAGCATACGTTTGGGGTATTGAGGGTGACTACCACCCATACAGATGTACCTTAGTTGAAGGTA
AGGAAAATGTAAAGGTACCAGAAACTGCTTTAGTCTTCAATACAACCACCAAGAAGTGGCAACCTGATCA
TGCTGGAGAAGTTGCTCCAACCGCGGCTACCTTCAAGTGCCAAAAGATCTACTTCCACGATGGCCACAAG
CTCACAGTTGCTGATGTAATGTACGGCTACTACTGGTCATGGGAGTGGTCAAGCCAAGATGGAGACCAAG
ATCCATACTTCGATGCAAACGAGGCTGACTGGAGCGCAGAAGCAATGCAAAAGCTCCTCGGTATTGAGGT
TAAGGAAGAAGACGATAATTACTTTGTAGTAACCATCTACCACACCTACACATTCCCACCCTACAAGAAG
TATCAATACTGGTACTTCACGCCCTACGCAAGCTATCCATGGCAACTCATTTATGCCATGAGCGAACTTG
TTGCCGAGAGCAACAGGGCTAGGTTTGCCAACCAGACTGAAGGTGTAGAATTGTTCTCATTCAGTGAATC
TACTGAAGACATTCAACAGATTGATATGCTAACACCTTCTCACGCTAAGAAGGTTGCTGAAATGCTTGAG
AAGTTGAAGAATGAGAAGCCAATACCTGACGTTATTAAGGACTTCATCTATGACGAGCAGGACGAGATTA
AGGAATATGACTCCATTATCAACTTTATAAACACTCACAACCACATGTTCATTTCAACTGGGCCATATCT
AATTGATGTCTACAAGCCTGAGAACCTCTATCTAAGGTATGTTAAGTTTGACAAGTGGGTCAAGCCAGAG
TTTGCTGAGGACATGTACAACTTTGAGCCATACTTCGATGTTGTAGAGCTTTATGGTATCCAGAACGAGA
ACACGATAATTCTTGGTGTAGCAAGTGGAGAGTACGATGTTTCATGGTACTCATTCCCATCATTCACGTT
CTCTGGACTTAGTGATGAGCAGAAGAGCAACATTGACATGTACGTTAACATTGGTGGATTCTGGGACATG
GTCTGGAACCCAGTGCACGACAAGGATAATCCATATGTGATTACAGTTGGTGACAAGAAGTACTTCAACC
CATTCGCAATTAGAGAGATAAGATTTGCAATGGAATACCTCATCAACAGAAACTACATCATCCAGAACAT
CCTCCAGGGTTCAGGTGGACCAATGTACACTCCATGGACAAGTGGTGATACGGTTGCAATCGAGAAGCTA
CAGCCAGTTGTCGATGCCTTTGGTATCGATGCACAGGGTGACGAAGAGTATGCTCTCCAGCTAATTGAAC
AGGCGATGCAAAAGGCCGCTAGAGAGTTAGCTAACATGGGATATGAGCTCAAGAAGGTTAACGGAAAGTG
GTACTTCAACGGAGAGCCAGTTAAGATCGTTGGAATTGGAAGACAAGAAGATGAGAGAAAGGATGAGGCT
TACTACATTGCAGAAATCCTTAGAAAGGCTGGATTTGAGGTTGAAGTTAAGATAGTTGACAGAAGAACTG
CCAACCAGATAGTATACCTCTCAGACCCAGCTAACTATGAATGGGGTTATTACACTGAAGGATGGGTAGC
AAGTGGAAGCGTTCTCTTCTCAATTAGCAGAATCCTACAGTACTACACCACAGCATGGTTTGGTCCAGGA
TTCGTAGGTTGGAAGTTCACACCAGAGAACACATACAGAGCAACAGTAGAAGAAGTCCTCAAGTATCTTG
GAAATGGTGACATTCAGGCAGCTATTGACATGCTTGAACTTGAGTACTACACCACTCCAGACAAGCTTGA
ACCAATACTTGACTGGACAGCAGATGATATCGGATGGCTTATCTACACAAGCAATTACAAGAACCAGACA
CTAGACTCTGAAGCTAAGTACTGGGACCTAACTAAGATTGGTGCTGCTATTGGTATCTACGAGAGCTTCA
GAGTCTTTACAGCAGAAACCTGGGAGTTCTTCCCAGTCAACAAGAGAATTAAATTCAGAGTTATGGATCC
AGCAGTTGGTCTAGGAAACAGCATCGTTATGAAGAGCGCCTACCTTGCTGAGGCTCCAGAGACACCAACC
CAGACTGAGACTACTACCACCCAGACCACTACAACTCAAACAACCACCACAACCCCATCACCAACCCAGA
CTCAACCGACTACTACACAATCTCCAACTGAGACTGGAGGAATCTGTGGACCAGCGATACTTGTTGGTCT
CGCAGTAGTTCCACTCCTCCTGAGAAGGTTTAAGAAGTAG

Intuitively, I interpret this as the sequence of the gene that encondes hypothetical protein.

Is this interpretation completely wrong?

Maybe my uncertainities could be a bit removed by studying better how the experiment is performed... but I am really curious to know if this explanation is good before studying hours and hours DNA sequencing experiments.

Thank you in advance.



Source link