gravatar for stuart archer

2 hours ago by

Hi all,

I've found that with a lot of read-trimming tools (and also soft-clipping aligners), it's a little hard to assess whether trimming / filtering / clipping is justified. How do you know they are doing what you think they are doing? It's hard to tell by just digging through the resulting fastq files by eye, even harder to cross-reference with downstream alignment results to see where the trimmed reads ultimately aligned. And even when I don't trim, the aligner soft-clips a lot of read-sequence. Sometimes there may be some useful information in how and where the aligner clips, e.g. fusions, adapter contamination etc.

To address this I developed Trimviz, a lightweight tool for comparing pre-trimmed and post-trimmed fastq files from any trimmer/clipper.

It runs in 2 modes:
FQ mode is for QC of fastq trimming, by comparing pre- and post- trimmed fastq files. Stats for trimmed (or filtered) reads are given and visualized as well as a random selection of trimmed (and untrimmed) reads. In a more complex pipeline, you can also add in a .bam alignment of the trimmed fastq, to see where the trimmed reads actually ended up in the subsequent alignment and whether trimming really helped to align the reads.

SC mode is for QC of soft-clipping by aligners. Basically you can treat bam-files from a soft-clipping aligner as a trimmed read file, and quickly check out whether there are any recurring patterns in clipped sequence (low quality, adapter sequence, etc). Optionally you can include the reference sequence and look at flanking reference sequences around the soft-clipping points, and in diff mode you can just highlight the differences between the reference and the clipped sequences, to see whether clipping / trimming was really justified.

I've uncovered quite a few issues with sequencing, alignment, and even my own reference assemblies using this so I now use it as a key QC step in all my pipelines.

I'd recommend installing with Docker, but the dependencies are not too numerous (a few R libraries and seqtk). A quick-start guide, example reports and some more in-depth examples are up on Github.

I'm happy to help with any questions / issues / suggestions.



Source link