Illumina read filtering - HPC guide
In this tutorial we will process raw Illumina reads for downstream mapping/assembly. Download and extract the following read 1 and read 2 data files.
This is paired end data. Reads are 150 bp long from a size selection of ~270-400 bp (tapestation result).
Create a project directory, sub-folders and import data
‘mkdir project’
‘cd project’
‘mkdir 00-raw 01-fastqc 02-merged_reads 03-remove_contaminants 04-remove_short_reads 05-demultiplex’
Move the raw Illumina reads (.fastq format) to the 00-raw directory within the new project
I use the FTP client FileZilla to transfer between PC/HPC.
1. Quality checking using Fastqc
Navigate to the 01-fastqc folder
‘cd 01-fastqc’
This script will run fastqc - email is optional, remove last two #SBATCH lines if needed
‘nano run_fastqc’
Paste in the following italicised text;
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=4G
#SBATCH --time=00-02
#SBATCH --mail-user=name@place.com
#SBATCH --mail-type=ALL
module load fastqc
fastqc ../00-raw/*fastq -o ./
‘Ctrl x’ to exit and ‘y’ to save
Now we can submit the job by typing ‘sbatch run_fastqc’
Once fastqc is complete type ‘ls-1’ and you will see a .html and a .zip for each raw file
[user@hpc 01-fastqc]$ ls -1
IDX01_S1_L001_R1_001_fastqc.html
IDX01_S1_L001_R1_001_fastqc.zip
Transfer the .html file to your PC and open it in any internet browser
The quality of this data is good, although it is typical of ‘low diversity’ RADseq data
2. Merge overlapping reads
Now we can merge our read 1 and read 2 read files using PEAR. Once this is completed, my standard approach is to analyse the merged reads and non-merged read 1 reads (i.e. discard only non-merged read 2 reads).
Navigate to the merged reads folder ‘cd ../02-merged_reads/’.
We will set up a script that will merge our reads into 160-260 bp long merged reads and require a minimum of 20 matching bases. This can easily be changed by altering the ‘min_overlap=20’, ‘min_length=160’ and ‘max_length=260’ fields in the below script.
In this script we will create a directory within the ‘02-merged_reads folder’ called ‘IDX01_merged’. We will also combine our merged reads, with the non-merged read 1 reads using the ‘cat’ function. The script also includes a fastqc report for all of our outputs.
‘nano run_pear’
Paste in the following italicised text;
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=00-08
echo 'Starting at:' `date`
min_overlap=20
min_length=160
max_length=260
module load pear-gcc/0.9.1 fastqc
mkdir ./IDX01_merged
pear -f ../00-raw/IDX01_S1_L001_R1_001.fastq \
-r ../00-raw/IDX01_S1_L001_R2_001.fastq \
-o ./IDX01_merged/IDX01_merged -v $min_overlap -n $min_length -m $max_length \
-b 33 -j 4 -y 8G
cat ./IDX01_merged/IDX01_merged.assembled.fastq ./IDX01_merged/IDX01_merged.unassembled.forward.fastq > ./IDX01_merged/IDX01_merged_unmergedR1.fastq
fastqc ./IDX01_merged/*fastq -o ./IDX01_merged/
echo 'Finished at:' `date`
‘Ctrl x’ to exit and ‘y’ to save
We now have five .fastq files in the IDX01_merged directory ‘cd ./IDX01_merged/’
Our concatenated file containing all merged reads and non-merged read 1 reads
IDX01_merged_non-mergedR1.fastq
A file containing only the successfully merged reads
IDX01_merged.assembled.fastq
Two files containing non-merged read 1 and 2 reads
IDX01_merged.unassembled.forward.fastq
IDX01_merged.unassembled.reverse.fastq
A file containing discarded reads (didn’t pass the quality filtering step - phred 33)
IDX01_merged.discarded.fastq
The slurm file acts as a log file when running PEAR. We can look at it by typing ‘more slurm-*.out’. The important stats to look at are the number of assembled (merged) and not assembled (non-merged) reads.
Assembled reads ...................: 153,858 / 269,092 (57.177%)
Discarded reads ...................: 672 / 269,092 (0.250%)
Not assembled reads ...............: 114,562 / 269,092 (42.574%)
Finally, transfer the fastq .html files to your PC and open it in any internet browser. The one we are most interested in is for our ‘IDX01_merged_non-mergedR1.fastq‘ file. We can see that the size distribution makes sense and that the quality is great up until the final few bases (we can trim these later).
3. Remove contaminants - i.e. bacteria, archaea, and virus reads
Now we will filter out microbial contamination using Kraken. This requires a database to be installed on your HPC. Contact your admin person to for assistance with installing software and databases. We are using the ‘standard database’ from the Kraken link above. This is large and requires a lot of memory to run. In our script below we will select a specific partition with access to enough memory and allocate more RAM and CPUs.
Navigate to the filtering folder ‘cd ../../03-remove_contaminants’.
‘nano release_kraken’
Paste in the following italicised text;
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --partition bigmem
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=00-10
#SBATCH --mail-user=name@place.com
#SBATCH --mail-type=ALL
echo 'Starting at:' `date`
module load Kraken
database=standard
standard_db=/data/projects/KRAKEN/standarddb
for i in IDX01_merged_non-mergedR1
do
kraken ../02-merged_reads/IDX01_merged/"$i".fastq \
--unclassified-out "$i"_unc.fq \
--classified-out "$i"_classified_$database.fq \
--fastq-input --output "$i"_output_$database \
--db $standard_db/ --threads 4
done
echo 'Finished at:' `date`
You will need to change the ‘standard_db=’ path to your Kraken database directory.
‘Ctrl x’ to exit and ‘y’ to save
The resulting files are labelled as follows:
Unclassified reads - used for downstream analyses
IDX01_merged_non-mergedR1_unc.fq
Classified reads - discarded from downstream analyses
IDX01_merged_non-mergedR1_classified_standard.fq
Log file
slurm-*.out
The slurm file acts as a log file when running Kraken. We can look at it by typing ‘more slurm-*.out’
Starting at: Wed Jan 16 14:05:00 AEDT 2019
268420 sequences (43.08 Mbp) processed in 4.745s (3394.2 Kseq/m, 544.79 Mbp/m).
3433 sequences classified (1.28%)
264987 sequences unclassified (98.72%)
Finished at: Wed Jan 16 14:05:08 AEDT 2019
The stats in bold show us that we have retained most of our reads and that <2% of reads per file were matches for the microbial database.
4. Remove short reads - optional
This data was obtained from a 150 base pair sequencing run. We can see that there some reads as short as 30bp. You may wish to keep these reads, but if not the following script will remove them. Looking at the quality and sequence length distribution graphs from our merged file (at the end of step 2), we can see that the sequence quality deteriorates after 220 bp, but luckily there are very few reads over 220 bp in length. We will also remove them here and run a fastqc report.
Navigate to the trimming directory ‘cd ../04-remove_short_reads/’
‘nano remove_short_reads’
Note: the only part of the script you will need to change for future use is the file name(s) and or min_read_length=80 max_read_length=220 values.
Paste in the following italicised text;
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00-01
module load fastqc
min_read_length=80
max_read_length=220
for i in IDX01_merged_non-mergedR1_unc
do
awk -v min="$min_read_length" -v "max=$max_read_length" 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= min && length(seq) <= max) {print header, seq, qheader, qseq}}' < ../03-remove_contaminants/"$i".fq > ./"$i"_long.fq
done
fastqc ./*fq -o ./
‘Ctrl x’ to exit and ‘y’ to save
Now we can submit the job by typing ‘sbatch remove_short_reads’
We can check that the script worked by typing the following ‘ll -Sh’. We are expecting the long read files to be present and that they contain data (by checking the file size - see below).
-rw-r--r-- 1 93M Jan 16 15:04 IDX01_merged_non-mergedR1_unc_long.fq
-rw-r--r-- 1 421K Jan 16 15:04 IDX01_merged_non-mergedR1_unc_long_fastqc.zip
-rw-r--r-- 1 356K Jan 16 15:04 IDX01_merged_non-mergedR1_unc_long_fastqc.html
-rw-r--r-- 1 1.3K Jan 16 15:04 slurm-7242520.out
Transfer the ‘IDX01_merged_non-mergedR1_unc_long_fastqc.html’ file to your PC and open it in any internet browser. You can see that the read length distribution has changed and we have also removed the longest reads with quality issues.
5. Demultiplex - based on barcodes
Now we can use Stacks to allocate our filtered reads to individuals based on the barcoded adapters added during library preparation. I use the adapters from Peterson et al (2012).
Navigate to the demultiplexing directory ‘cd ../05-demultiplex/’
You will need to add a file containing the barcodes used in your project to the ‘05-demultiplex’ directory. I have only used eight for this example data set, but I have provided a file with the compete list of barcodes here. You can use the complete set, or create a file containing a subset.
Note: process_radtags will name your files according to their sample name if you add a second column to the barcode file. Example:
aagga site1_sample1
ggttg site1_sample2
caacc site2_sample1
…
‘nano run_process_radtags’
Paste in the following italicised text;
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=00-08
echo 'Starting at:' `date`
module load stacks-gcc/1.39
barcode_file=./barcodes
enzyme=ecoRI
for i in IDX01
do
mkdir "$i"_demultiplexed
process_radtags -f ../04-remove_short_reads/"$i"_merged_non-mergedR1_unc_long.fq \
-o ./"$i"_demultiplexed -q -i fastq \
-b $barcode_file -e $enzyme
done
echo 'Finished at:' `date`
Navigate to the output directory by typing ‘cd ./IDX01_demultiplexed’. We can check the barcodes/files that had reads associated with them by typing ‘ll -S’. I have pasted the top thirteen lines/files below. We can see each barcode file and I have underlined the file sizes. I used the top eight barcodes, so I will rename them later.
-rw-r--r-- 1 15207034 Jan 16 15:46 sample_TCGAT.fq
-rw-r--r-- 1 11155608 Jan 16 15:46 sample_CAACC.fq
-rw-r--r-- 1 10657537 Jan 16 15:46 sample_ATTAC.fq
-rw-r--r-- 1 10079982 Jan 16 15:46 sample_GCTGA.fq
-rw-r--r-- 1 8905090 Jan 16 15:46 sample_CGATC.fq
-rw-r--r-- 1 8699635 Jan 16 15:46 sample_CGGTA.fq
-rw-r--r-- 1 7850974 Jan 16 15:46 sample_TGCAT.fq
-rw-r--r-- 1 7424000 Jan 16 15:46 sample_GCCGT.fq
-rw-r--r-- 1 11922 Jan 16 15:46 sample_GAGAT.fq
-rw-r--r-- 1 5374 Jan 16 15:46 process_radtags.log
-rw-r--r-- 1 287 Jan 16 15:46 sample_TACGT.fq
-rw-r--r-- 1 239 Jan 16 15:46 sample_CTGAT.fq
-rw-r--r-- 1 0 Jan 16 15:46 sample_AACCA.fq
We can check the number of reads retained per barcode by typing ‘more process_radtags.log’.
Barcode Total No RadTag Low Quality Retained
GCATG 0 0 0 0
AACCA 78 78 0 0
CGATC 25099 663 0 24436
TCGAT 45237 435 0 44802
TGCAT 24085 276 0 23809
CAACC 33523 224 1 33298