Skip to content
Snippets Groups Projects
tamas.schauer's avatar
tamas.schauer authored
56373a67

Data processing

Setup

  • Login to the HPC via VS Code

  • Change directory

cd /storage/groups/shared/chromatin
  • Create your folder (this we did in the morning)
mkdir user.name
cd user.name
  • Start an interactive session (replace cpusrv22 with anything bettween 18-28)
srun -p cpu_p -w cpusrv22 --qos cpu_long --reservation=chromatin_summer_school -c 8 --mem=32G --pty bash
  • clone repository
git clone https://ascgitlab.helmholtz-munich.de/chromatin_summer_school_2024/workshops/data_processing.git

cd data_processing
  • activate environment
mamba activate env_summerschool2024_processing

 

FASTQ Files

FASTQ is a text based file format for storing nucleotide sequences. Each sequence has 4 lines:

  • sequence identifier
  • nucleotide sequence
  • "+" character
  • base calling quality score

FASTQ files are usually gzip compressed. We can look into the file using the zcatcommand

zcat < fastq/22L007826_R1.fastq.gz | head -n 8

Since the data are paired-end there are 2 files, R1 and R2. Note that the sequence identifiers match.

zcat < fastq/22L007826_R2.fastq.gz | head -n 8

Count the number of lines in the FASTQ files. R1 and R2 have the same number of lines.

zcat < fastq/22L007826_R1.fastq.gz | wc -l
zcat < fastq/22L007826_R2.fastq.gz | wc -l

Reformat the lines using the paste command.

zcat < fastq/22L007826_R1.fastq.gz | paste - - - - | head 

Questions

  • What does the - - - - do in the paste command?
  • What does @A00623 mean in the header?
  • What is the sequence TCCAACGC+NAGTCCAA in the header?

Exercises

  • Count the number reads
  • Count the number of TACAACGC sequences in the header
  • Which sequence occurs 611 times in the header?

(Hint: use commands e.g. grep, sed, sort,uniq -c)

Reading

 

FastQC Report

FastQC is a quality control tool that provides basic information about raw FASTQ files, such as base calling quality scores, sequence duplication levels and adapter content.

Run FastQC on on the example files

fastqc fastq/22L007826_R1.fastq.gz -o fastqc

fastqc fastq/22L007826_R2.fastq.gz -o fastqc

Open HTML output as Preview in VS Code.

Basic Statistics

  • file size
  • number of reads

Per base sequence quality

  • Phred quality score (Q): -10*log10 base-calling error probabilities
  • Q = 30 indicates 99.9% accuracy (1 error in 1000)

Sequence Duplication Levels

  • Duplicates: reads with the same sequence
    • PCR: amplified from the same DNA fragment
    • biological: orginated from different DNA molecules
    • optical: sensor artifacts of the sequencer

Adapter Content

  • Type of adapters:
    • Illumina
    • Nextera
    • small RNA

Questions

  • Is the base calling quality ok?
  • What is the percentage of unique reads?
  • Why is the adapter content higher towards the end?
  • What assay could this be?

Reading

FastQC Documentation

 

Adater Trimming

Trim Galore

Trim Galore is a wrapper script around Cutadapt to automatically detect and remove commonly used adapter sequences.

Parameters

  • -j number of cores
  • --quality Phred score. Low quality read ends are removed
  • --paired paired end data
  • 2> logs/log_trimgalore.txt console output is saved in logs

 

trim_galore -j 8 --quality 28 --paired  -o trimmed fastq/22L007826_R1.fastq.gz fastq/22L007826_R2.fastq.gz 2> logs/log_trimgalore.txt

Open TXT report from trimmed/22L007826_R1.fastq.gz_trimming_report.txt Open TXT log file from logs/log_trimgalore.txt

Run FastQC after trimming

fastqc trimmed/22L007826_R1_val_1.fq.gz -o fastqc

fastqc trimmed/22L007826_R2_val_2.fq.gz -o fastqc

Open HTML output as Preview in VS Code.

Questions

  • Did the adapter removal work?
  • How does the sequence length distribution change after trimming?
  • What is the disadvantage of changed read length?
  • Why did the GC content change?
  • Which QC parameters did no change (2 examples)?

 

Readings

Trim Galore User Guide

Cutadapt Documentation

 

FASTA File

FASTA is another text based file format for storing nucleotide sequences. Each sequence has a header which starts with the ">" symbol

Adapter sequences in FASTA format

cat fasta/adapters.fa
>Illumina
AGATCGGAAGAGC
>Small RNA
TGGAATTCTCGG
>Nextera
CTGTCTCTTATATCCGAGCCCACGAGAC

Genome FASTA file

The genome FASTA file is organized by chromosomes. The header indicates the chromosome name. Usually each chromosome starts with a large block of 'N' nucleotides.

Check the beginning of the example file

head -n 20 fasta/hg38_sub.fa

Search for the first line with a 'T'

cat fasta/hg38_sub.fa | grep "T" | head -n 1

Questions

  • From which database might the example file come from?
  • What are the sequences with lowercase letters?

Exercises

  • How many chromosomes are present in the example file?
  • How many characters are there in a sequence line?
  • How many actual nucleotides are there in a sequence line?
  • Count the total number of nucleotides in the example file (including N,n)
  • Count the number of nucleotides in the example file (including G,A,T,C,g,a,t,c)
  • Count the number of N/n nucleotides

Reading

UCSC FASTA Ensembl FASTA

 

Bowtie2 Alignment

Bowtie2 is a general-purpose short-read sequence aligner for mapping reads to the reference genome. The reference genome has to be indexed for bowtie2. Indexing makes the alignment memory efficient.

Prepare the example FASTA file

bowtie2-build --threads 8 fasta/hg38_sub.fa bowtie2_index/hg38_sub

 

Alignment using default settings. Arguments:

  • -x path and base name of the index files
  • --threads number of parallel threads
  • -1 and -2 read mate 1 and 2 trimmed FASTQ files, respectively
  • 2< save console output in logs
  • | samtools view -Sbh pipe output into samtools and save it in a BAM file
bowtie2 -x bowtie2_index/hg38_sub --threads 8 -1 trimmed/22L007826_R1_val_1.fq.gz -2 trimmed/22L007826_R2_val_2.fq.gz 2> logs/log_bowtie2_default.txt | samtools view -Sbh -o bam/22L007826.default.bam

 

Alignment with additional parameters.

  • --end-to-end the full read aligns without trimming (default)
  • --very-sensitive preset option for more sensitivity and accuracy
  • --no-unal unaligned reads are excluded
  • --no-mixed alignments for individual read mates are excluded
  • --no-discordant alignments that are not considered concordant pairs are excluded
  • --dovetail dovetailing allowed (read extend beyond the end of the other read)
  • -I minimum fragment length (default: 0)
  • -X maximum fragment length (default: 500)
params="--end-to-end --very-sensitive --no-unal --no-mixed --no-discordant --dovetail -I 10 -X 700"

bowtie2 -x bowtie2_index/hg38_sub --threads 8 $params -1 trimmed/22L007826_R1_val_1.fq.gz -2 trimmed/22L007826_R2_val_2.fq.gz 2> logs/log_bowtie2_endtoend.txt | samtools view -Sbh -o bam/22L007826.endtoend.bam

 

Local alignment.

  • --local alignments are trimmed at the ends
  • --very-sensitive-local preset option for more sensitivity and accuracy
params="--local --very-sensitive-local --no-unal --no-mixed --no-discordant --dovetail -I 10 -X 700"

bowtie2 -x bowtie2_index/hg38_sub --threads 8 $params -1 trimmed/22L007826_R1_val_1.fq.gz -2 trimmed/22L007826_R2_val_2.fq.gz 2> logs/log_bowtie2_local.txt | samtools view -Sbh -o bam/22L007826.local.bam

 

Investigate the bowtie2log files

logs/log_bowtie2_default.txt
logs/log_bowtie2_endtoend.txt
logs/log_bowtie2_local.txt

 

Questions

  • Typical alignment rates are above 80%. Why is the overall alignment rate so low here?
  • Which setting gives the most uniquely aligned reads?
  • Which setting results in the highest overall alignment rate?
  • Which reads aligned more likely in local mode compared to the others?

Exercises

  • samtools flagstat can be used for further inspection of the alignments.
  • How many singleton reads are there in the default alignment file?
  • How many reads have a mate mapped to a different chromosome (default settings)?

 

Reading

Bowtie2 Manual

Samtools flagstat

 

SAM/BAM files

SAM is a text based file format for storing alignments to the reference genome. BAM is a binary, comparessed version of SAM files.

Columns of SAM files:

  • read name
  • flag
  • reference sequence name (chromosome)
  • leftmost genomic coordincate
  • MAPQ: mapping quality
  • CIGAR: Concise Idiosyncratic Gapped Alignment Report
  • reference name of next read
  • position of next read
  • TLEN: template length
  • read sequence
  • base call quality
  • additional fields:
    • AS: Alignment Score
    • XS: alignment score score for next best alignment
    • NM: edit distance

Print first 20 lines of the BAMfiles

samtools view bam/22L007826.default.bam | head -n 20

samtools view bam/22L007826.endtoend.bam | head -n 20

samtools view bam/22L007826.local.bam | head -n 20

Print first 20 fragment length values (9th column)

samtools view bam/22L007826.endtoend.bam | cut -f 9 | head -n 20

Exercises

  • What is the largest fragment length in the default bam file?
  • What is the flag of the alignment with largest fragment length in the default bam file? What does it mean?
  • What is the largest fragment length in the endtoend bam file?

Readings

 

Sort BAM files

It is recommeded to sort BAM files as most of downstream processing tools require sorted alignments.

Sorting modes:

  • genomic coordinates
  • read names

Sort alignments using samtools

samtools sort --threads 8 -o bam/22L007826.default.sorted.bam bam/22L007826.default.bam

samtools sort --threads 8 -o bam/22L007826.endtoend.sorted.bam bam/22L007826.endtoend.bam

samtools sort --threads 8 -o bam/22L007826.local.sorted.bam bam/22L007826.local.bam

Print first 10 lines of the sorted BAM files

samtools view bam/22L007826.default.sorted.bam | head -n 10

samtools view bam/22L007826.endtoend.sorted.bam | head -n 10

samtools view bam/22L007826.local.sorted.bam | head -n 10

Questions

  • What ist the default mode of sorting by samtools?

Exercises

  • What is the name of the first read when the endtoend.bam file is sorted by name?

Readings

Filter BAM files

Alignments are filtered by mapping quality (MAPQ). MAPQ is defined as the −10*log10 probability that the mapping position is wrong. MAPQ depends on the number of mismatches (MM) and the base calling quality (Q)

MAPQ >= X   #MM Q40   #MM Q20      #MM Q0    Description
0           5         7            15        All mappable reads
1           3         5            10        True multi w/ "good" AS, maxi of MAPQ >= 1
2           3         5            10        No true multi, maxi of MAPQ >= 2
3           3         5            10        No true multi,  maxi of MAPQ >= 3
8           2         4            8         No true multi, maxi of MAPQ >= 8
23          2         3            7         No true multi, maxi of MAPQ >= 23
30          1         2            4         No true multi, maxi of MAPQ >= 30
39          1         2            4         No true multi, maxi of MAPQ == 39*
40          1         2            4         No true multi, only true uni-reads
42          0         1            2         Only "perfect" true unireads

Table source: see Readings MAPQ explained.

 

Filter BAM files using samtools view -q

samtools view -q 12 --threads 8 -o bam/22L007826.default.filtered.bam bam/22L007826.default.sorted.bam

samtools view -q 12 --threads 8 -o bam/22L007826.endtoend.filtered.bam bam/22L007826.endtoend.sorted.bam

samtools view -q 12 --threads 8 -o bam/22L007826.local.filtered.bam bam/22L007826.local.sorted.bam

Exercises

  • How many reads are there before and after filtering in the endtoend files?

Readings

 

Remove duplicates

Duplicate reads can be marked and removed using Picard.

Parameters:

  • REMOVE_DUPLICATES if TRUE duplicates are removed, otherwise only marked
picard MarkDuplicates I=bam/22L007826.endtoend.filtered.bam O=bam/22L007826.endtoend.markdup.bam REMOVE_DUPLICATES=FALSE METRICS_FILE=logs/log_picard_markdup.txt

picard MarkDuplicates I=bam/22L007826.endtoend.filtered.bam O=bam/22L007826.endtoend.rmdup.bam REMOVE_DUPLICATES=TRUE METRICS_FILE=logs/log_picard_rmdup.txt 

Inspect the metrics files:

logs/log_picard_markdup.txt
logs/log_picard_rmdup.txt

Exercises

  • What is the percentage of duplicated reads?
  • How many read pairs are there after removing duplicates in the endtoend files?
  • What is the percentage of reads after all the step relative to the raw data?

Reading

Picard MarkDuplicates

Visualize Alignments in IGV

Several downstream tool require that the BAM files are indexed. This enables fast retrieval of partical regions.

Index BAM files using samtools index

samtools index bam/22L007826.endtoend.sorted.bam
samtools index bam/22L007826.endtoend.filtered.bam
samtools index bam/22L007826.endtoend.rmdup.bam

The index (BAI) files are automatically named and generated in the same folder.

ls -l bam/*.bai

IGV instructions

  • Download the BAMand BAI to a local directory.
  • Load the BAM files into IGV browser. (use File / Load from File)
  • Zoom in to the region: chr20:54,211,175-54,212,897
  • Right lick on the file name at the left panel. Set View as pairs.

Questions

  • What is the MAPQ and the edit distance of the white colored read (chr20:54,211,915-54,211,987) in the sorted.bam file?
  • How many duplicates are there for the read chr20:54,212,255-54,212,346?
  • What is the insert size of the read chr20:54,213,010-54,213,109?

Readings