Hi Everyone,
I have some fastq data downloaded from Illumina sequence hub. In these data, the sequences from one samples were split into four .gz files (i dont know why illumina does that). All the files together is about 18G. I first try to zcat every four files into one, but the file size inflated significantly from 18G to 78G:
for i in $(ls *.fastq.gz | rev | cut -c 22- | rev | uniq);
do zcat ${i}_L001_R1_001.fastq.gz ${i}_L002_R1_001.fastq.gz ${i}_L003_R1_001.fastq.gz ${i}_L004_R1_001.fastq.gz > ./zcat_fastq/${i}.fastq.gz ;
done
I then did the dumb way, guzip all files, cat them together and then gzip them all, but now the files size is about 15Gb.
gunzip *.gz
for i in $(ls *.fastq | rev | cut -c 19- | rev | uniq);
do cat ${i}_L001_R1_001.fastq ${i}_L002_R1_001.fastq ${i}_L003_R1_001.fastq ${i}_L004_R1_001.fastq > ./cat_fastq/${i}.fastq ;
done
gzip *.fastq
Can anyone please advice what is going on and which method is correct? Thank you!
Best,
Wenhan