randomly sampling long read data for genome assembly


So I've been working on bacterial genome assemblies with long reads (pacbio) of evolved strains for Bacillus subtilis and I have been running into an issue where depending upon the program I'm using, I run into some kind of error with regards to the amount of hardware I have access to. For context, I'm using a SLURM-based computation node that my university provides and my lab rents ~5 CPUs and ~35GB of RAM. So far all my work has been with using short reads and I've had no problem with the assemblies using these tools. However, with long reads I either do not have enough memory or enough computing threads to run a project. For example, if I use flye, I don't have anywhere near enough memory even using 30 GB of RAM or if I use NextDenovo, I don't have enough CPUs to dedicate to each of the separate tasks required to execute the whole pipeline.

When I was researching the memory requirements for flye, I saw that it has very high memory requirements, but I also think my input samples are much larger than they need to be to get a good assembly. According to this benchmarking study, the vast majority of prokaryotic genomes were able to be assembled with fewer than 30 GB of RAM (see 'G' in figure). (see 'G' in figure)

Each of my .fastq files for my different strains that need assembling are 42-46 GB zipped. From what I can gather, that looks like way more data than I need to get a good assembly, so I'm wondering if it would be possible for me to randomly sample and subset a small fraction of my .fastq files to perform assemblies with. Does anyone know if this would work and if not do you know any other approach I could take for finishing these assemblies with the limited resources I have?





updated 19 minutes ago by


written 2 hours ago by



Source link