Data processing
Setup
-
Login to the HPC via VS Code
-
Change directory
cd /storage/groups/shared/chromatin
- Create your folder (this we did in the morning)
mkdir user.name
cd user.name
- Start an interactive session (replace
cpusrv22
with anything bettween 18-28)
srun -p cpu_p -w cpusrv22 --qos cpu_long --reservation=chromatin_summer_school -c 8 --mem=32G --pty bash
- clone repository
git clone https://ascgitlab.helmholtz-munich.de/chromatin_summer_school_2024/workshops/data_processing.git
cd data_processing
- activate environment
mamba activate env_summerschool2024_processing
FASTQ Files
FASTQ is a text based file format for storing nucleotide sequences. Each sequence has 4 lines:
- sequence identifier
- nucleotide sequence
- "+" character
- base calling quality score
FASTQ files are usually gzip
compressed. We can look into the file using the zcat
command
zcat < fastq/22L007826_R1.fastq.gz | head -n 8
Since the data are paired-end there are 2 files, R1
and R2
. Note that the sequence identifiers match.
zcat < fastq/22L007826_R2.fastq.gz | head -n 8
Count the number of lines in the FASTQ files. R1
and R2
have the same number of lines.
zcat < fastq/22L007826_R1.fastq.gz | wc -l
zcat < fastq/22L007826_R2.fastq.gz | wc -l
Reformat the lines using the paste
command.
zcat < fastq/22L007826_R1.fastq.gz | paste - - - - | head
Questions
- What does the
- - - -
do in thepaste
command? - What does
@A00623
mean in the header? - What is the sequence
TCCAACGC+NAGTCCAA
in the header?
Exercises
- Count the number reads
- Count the number of
TACAACGC
sequences in the header - Which sequence occurs
611
times in the header?
(Hint: use commands e.g. grep
, sed
, sort
,uniq -c
)
Reading
FastQC Report
FastQC is a quality control tool that provides basic information about raw FASTQ files, such as base calling quality scores, sequence duplication levels and adapter content.
Run FastQC on on the example files
fastqc fastq/22L007826_R1.fastq.gz -o fastqc
fastqc fastq/22L007826_R2.fastq.gz -o fastqc
Open HTML
output as Preview in VS Code.
Basic Statistics
- file size
- number of reads
Per base sequence quality
- Phred quality score (Q): -10*log10 base-calling error probabilities
- Q = 30 indicates 99.9% accuracy (1 error in 1000)
Sequence Duplication Levels
- Duplicates: reads with the same sequence
- PCR: amplified from the same DNA fragment
- biological: orginated from different DNA molecules
- optical: sensor artifacts of the sequencer
Adapter Content
- Type of adapters:
- Illumina
- Nextera
- small RNA
Questions
- Is the base calling quality ok?
- What is the percentage of unique reads?
- Why is the adapter content higher towards the end?
- What assay could this be?
Reading
Adater Trimming
Trim Galore
Trim Galore is a wrapper script around Cutadapt to automatically detect and remove commonly used adapter sequences.
Parameters
-
-j
number of cores -
--quality
Phred score. Low quality read ends are removed -
--paired
paired end data -
2> logs/log_trimgalore.txt
console output is saved in logs
trim_galore -j 8 --quality 28 --paired -o trimmed fastq/22L007826_R1.fastq.gz fastq/22L007826_R2.fastq.gz 2> logs/log_trimgalore.txt
Open TXT
report from trimmed/22L007826_R1.fastq.gz_trimming_report.txt
Open TXT
log file from logs/log_trimgalore.txt
Run FastQC after trimming
fastqc trimmed/22L007826_R1_val_1.fq.gz -o fastqc
fastqc trimmed/22L007826_R2_val_2.fq.gz -o fastqc
Open HTML
output as Preview in VS Code.
Questions
- Did the adapter removal work?
- How does the sequence length distribution change after trimming?
- What is the disadvantage of changed read length?
- Why did the GC content change?
- Which QC parameters did no change (2 examples)?
Readings
FASTA File
FASTA is another text based file format for storing nucleotide sequences. Each sequence has a header which starts with the ">" symbol
Adapter sequences in FASTA format
cat fasta/adapters.fa
>Illumina
AGATCGGAAGAGC
>Small RNA
TGGAATTCTCGG
>Nextera
CTGTCTCTTATATCCGAGCCCACGAGAC
Genome FASTA file
The genome FASTA file is organized by chromosomes. The header indicates the chromosome name. Usually each chromosome starts with a large block of 'N' nucleotides.
Check the beginning of the example file
head -n 20 fasta/hg38_sub.fa
Search for the first line with a 'T'
cat fasta/hg38_sub.fa | grep "T" | head -n 1
Questions
- From which database might the example file come from?
- What are the sequences with lowercase letters?
Exercises
- How many chromosomes are present in the example file?
- How many characters are there in a sequence line?
- How many actual nucleotides are there in a sequence line?
- Count the total number of nucleotides in the example file (including N,n)
- Count the number of nucleotides in the example file (including G,A,T,C,g,a,t,c)
- Count the number of N/n nucleotides
Reading
Bowtie2 Alignment
Bowtie2 is a general-purpose short-read sequence aligner for mapping reads to the reference genome. The reference genome has to be indexed for bowtie2. Indexing makes the alignment memory efficient.
Prepare the example FASTA file
bowtie2-build --threads 8 fasta/hg38_sub.fa bowtie2_index/hg38_sub
Alignment using default settings. Arguments:
-
-x
path and base name of the index files -
--threads
number of parallel threads -
-1
and-2
read mate 1 and 2 trimmedFASTQ
files, respectively -
2<
save console output in logs -
| samtools view -Sbh
pipe output into samtools and save it in aBAM
file
bowtie2 -x bowtie2_index/hg38_sub --threads 8 -1 trimmed/22L007826_R1_val_1.fq.gz -2 trimmed/22L007826_R2_val_2.fq.gz 2> logs/log_bowtie2_default.txt | samtools view -Sbh -o bam/22L007826.default.bam
Alignment with additional parameters.
-
--end-to-end
the full read aligns without trimming (default) -
--very-sensitive
preset option for more sensitivity and accuracy -
--no-unal
unaligned reads are excluded -
--no-mixed
alignments for individual read mates are excluded -
--no-discordant
alignments that are not considered concordant pairs are excluded -
--dovetail
dovetailing allowed (read extend beyond the end of the other read) -
-I
minimum fragment length (default: 0) -
-X
maximum fragment length (default: 500)
params="--end-to-end --very-sensitive --no-unal --no-mixed --no-discordant --dovetail -I 10 -X 700"
bowtie2 -x bowtie2_index/hg38_sub --threads 8 $params -1 trimmed/22L007826_R1_val_1.fq.gz -2 trimmed/22L007826_R2_val_2.fq.gz 2> logs/log_bowtie2_endtoend.txt | samtools view -Sbh -o bam/22L007826.endtoend.bam
Local alignment.
-
--local
alignments are trimmed at the ends -
--very-sensitive-local
preset option for more sensitivity and accuracy
params="--local --very-sensitive-local --no-unal --no-mixed --no-discordant --dovetail -I 10 -X 700"
bowtie2 -x bowtie2_index/hg38_sub --threads 8 $params -1 trimmed/22L007826_R1_val_1.fq.gz -2 trimmed/22L007826_R2_val_2.fq.gz 2> logs/log_bowtie2_local.txt | samtools view -Sbh -o bam/22L007826.local.bam
Investigate the bowtie2
log files
logs/log_bowtie2_default.txt
logs/log_bowtie2_endtoend.txt
logs/log_bowtie2_local.txt
Questions
- Typical alignment rates are above 80%. Why is the overall alignment rate so low here?
- Which setting gives the most uniquely aligned reads?
- Which setting results in the highest overall alignment rate?
- Which reads aligned more likely in
local
mode compared to the others?
Exercises
-
samtools flagstat
can be used for further inspection of the alignments. - How many singleton reads are there in the default alignment file?
- How many reads have a mate mapped to a different chromosome (default settings)?
Reading
SAM/BAM files
SAM is a text based file format for storing alignments to the reference genome. BAM is a binary, comparessed version of SAM files.
Columns of SAM files:
- read name
- flag
- reference sequence name (chromosome)
- leftmost genomic coordincate
- MAPQ: mapping quality
- CIGAR: Concise Idiosyncratic Gapped Alignment Report
- reference name of next read
- position of next read
- TLEN: template length
- read sequence
- base call quality
- additional fields:
- AS: Alignment Score
- XS: alignment score score for next best alignment
- NM: edit distance
Print first 20 lines of the BAM
files
samtools view bam/22L007826.default.bam | head -n 20
samtools view bam/22L007826.endtoend.bam | head -n 20
samtools view bam/22L007826.local.bam | head -n 20
Print first 20 fragment length values (9th column)
samtools view bam/22L007826.endtoend.bam | cut -f 9 | head -n 20
Exercises
- What is the largest fragment length in the
default
bam file? - What is the flag of the alignment with largest fragment length in the
default
bam file? What does it mean? - What is the largest fragment length in the
endtoend
bam file?
Readings
Sort BAM files
It is recommeded to sort BAM
files as most of downstream processing tools require sorted alignments.
Sorting modes:
- genomic coordinates
- read names
Sort alignments using samtools
samtools sort --threads 8 -o bam/22L007826.default.sorted.bam bam/22L007826.default.bam
samtools sort --threads 8 -o bam/22L007826.endtoend.sorted.bam bam/22L007826.endtoend.bam
samtools sort --threads 8 -o bam/22L007826.local.sorted.bam bam/22L007826.local.bam
Print first 10 lines of the sorted BAM
files
samtools view bam/22L007826.default.sorted.bam | head -n 10
samtools view bam/22L007826.endtoend.sorted.bam | head -n 10
samtools view bam/22L007826.local.sorted.bam | head -n 10
Questions
- What ist the default mode of sorting by
samtools
?
Exercises
- What is the name of the first read when the
endtoend.bam
file is sorted by name?
Readings
Filter BAM files
Alignments are filtered by mapping quality (MAPQ
). MAPQ
is defined as the −10*log10 probability that the mapping position is wrong.
MAPQ
depends on the number of mismatches (MM
) and the base calling quality (Q
)
MAPQ >= X #MM Q40 #MM Q20 #MM Q0 Description
0 5 7 15 All mappable reads
1 3 5 10 True multi w/ "good" AS, maxi of MAPQ >= 1
2 3 5 10 No true multi, maxi of MAPQ >= 2
3 3 5 10 No true multi, maxi of MAPQ >= 3
8 2 4 8 No true multi, maxi of MAPQ >= 8
23 2 3 7 No true multi, maxi of MAPQ >= 23
30 1 2 4 No true multi, maxi of MAPQ >= 30
39 1 2 4 No true multi, maxi of MAPQ == 39*
40 1 2 4 No true multi, only true uni-reads
42 0 1 2 Only "perfect" true unireads
Table source: see Readings MAPQ explained.
Filter BAM
files using samtools view -q
samtools view -q 12 --threads 8 -o bam/22L007826.default.filtered.bam bam/22L007826.default.sorted.bam
samtools view -q 12 --threads 8 -o bam/22L007826.endtoend.filtered.bam bam/22L007826.endtoend.sorted.bam
samtools view -q 12 --threads 8 -o bam/22L007826.local.filtered.bam bam/22L007826.local.sorted.bam
Exercises
- How many reads are there before and after filtering in the
endtoend
files?
Readings
Remove duplicates
Duplicate reads can be marked and removed using Picard
.
Parameters:
-
REMOVE_DUPLICATES
if TRUE duplicates are removed, otherwise only marked
picard MarkDuplicates I=bam/22L007826.endtoend.filtered.bam O=bam/22L007826.endtoend.markdup.bam REMOVE_DUPLICATES=FALSE METRICS_FILE=logs/log_picard_markdup.txt
picard MarkDuplicates I=bam/22L007826.endtoend.filtered.bam O=bam/22L007826.endtoend.rmdup.bam REMOVE_DUPLICATES=TRUE METRICS_FILE=logs/log_picard_rmdup.txt
Inspect the metrics files:
logs/log_picard_markdup.txt
logs/log_picard_rmdup.txt
Exercises
- What is the percentage of duplicated reads?
- How many read
pairs
are there after removing duplicates in theendtoend
files? - What is the percentage of reads after all the step relative to the raw data?
Reading
Visualize Alignments in IGV
Several downstream tool require that the BAM
files are indexed. This enables fast retrieval of partical regions.
Index BAM
files using samtools index
samtools index bam/22L007826.endtoend.sorted.bam
samtools index bam/22L007826.endtoend.filtered.bam
samtools index bam/22L007826.endtoend.rmdup.bam
The index (BAI
) files are automatically named and generated in the same folder.
ls -l bam/*.bai
IGV instructions
- Download the
BAM
andBAI
to a local directory. - Load the
BAM
files into IGV browser. (use File / Load from File) - Zoom in to the region:
chr20:54,211,175-54,212,897
- Right lick on the file name at the left panel. Set View as pairs.
Questions
- What is the MAPQ and the edit distance of the white colored read (
chr20:54,211,915-54,211,987
) in thesorted.bam
file? - How many duplicates are there for the read
chr20:54,212,255-54,212,346
? - What is the insert size of the read
chr20:54,213,010-54,213,109
?