Bioinformatics Omics Analysis Pipeline
Table of Contents
Content Structure #
omics-analysis-example/
│
├── data/
│ ├── raw/
│ │ ├── *_1.fastq
│ │ └── *_2.fastq
│ │
│ └── trimmed/
│ ├── clean_1.fastq
│ └── clean_2.fastq
│
├── reference/
│ ├── genome.fa
│ ├── annotation.gtf
│ └── index/
│ ├── *.1.ht2
│ ├── *.2.ht2
│ └── ...
│
├── results/
│ ├── alignment/
│ │ ├── aligned.sam
│ │ ├── aligned.bam
│ │ ├── aligned.sorted.bam
│ │ └── aligned.sorted.bam.bai
│ │
│ ├── counts/
│ │ ├── gene_counts.txt
│ │ └── gene_counts.summary
│ │
│ └── variants/
│
├── qc/
│ ├── raw/
│ ├── trimmed/
│ ├── alignment/
│ └── multiqc/
│ ├── raw/
│ ├── trimmed/
│ └── final/
│
├── logs/
│ ├── fastp.log
│ ├── hisat2.log
│ ├── samtools.log
│ └── featurecounts.log
│
├── scripts/
│ ├── run_fastp.sh
│ ├── run_alignment.sh
│ ├── run_counts.sh
│ └── pipeline.sh
│
└── env/
└── environment.yml
Conda Environment Setup #
conda env create -f env/environment.yml
conda activate omics-analysis
Data Preparation #
Step 1: Download Dataset (SRA) #
A bulk RNA-seq dataset usually comes from one biological sample. In paired-end sequencing, it produces two FASTQ files: _1.fastq and _2.fastq, both containing reads from the same sample.
Sometimes, multiple samples are sequenced together in one run. Each sample is tagged with a barcode so they can be pooled during sequencing. This is called multiplexing. After sequencing, a process called demultiplexing separates the reads back into individual samples based on these barcodes. Each sample then has its own pair of FASTQ files: _1.fastq and _2.fastq.
A small human RNA-seq dataset (one sample) is downloaded here and used to demonstrate how to process and quantify counts in a standard bioinformatics pipeline for human RNA-seq analysis.
fasterq-dump ERR188273 --split-files -O data/raw/
Step 2: Download Reference Genome (GRCh38) #
wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
mv Homo_sapiens.GRCh38.dna.primary_assembly.fa reference/genome.fa
Step 3: Download Annotation #
GTF (Gene Transfer Format)
wget https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz
gunzip Homo_sapiens.GRCh38.111.gtf.gz
mv Homo_sapiens.GRCh38.111.gtf reference/annotation.gtf
Step 4: Build HISAT2 Index #
hisat2-build reference/genome.fa reference/index/grch38
Bioinformatics Pipeline #
Step 1: Quality Control (Raw Reads) #
fastqc data/raw/*.fastq -o qc/raw/
Step 2: Aggregate QC (Raw) #
multiqc qc/raw/ -o qc/multiqc/raw/
Step 3: Read Trimming #
fastp \
-i data/raw/ERR188273_1.fastq \
-I data/raw/ERR188273_2.fastq \
-o data/trimmed/clean_1.fastq \
-O data/trimmed/clean_2.fastq \
-h qc/trimmed/fastp.html \
-j qc/trimmed/fastp.json \
-w 10
Step 4: Quality Control (Trimmed Reads) #
fastqc data/trimmed/*.fastq -o qc/trimmed/
Step 5: Aggregate QC (Trimmed) #
multiqc qc/trimmed/ -o qc/multiqc/trimmed/
Step 6: Alignment (key step) #
hisat2 -p 10 -x reference/index/grch38 \
-1 data/trimmed/clean_1.fastq \
-2 data/trimmed/clean_2.fastq \
-S results/alignment/aligned.sam
Step 7: BAM Processing #
samtools view -bS results/alignment/aligned.sam | samtools sort -o results/alignment/aligned.sorted.bam
samtools index results/alignment/aligned.sorted.bam
rm results/alignment/aligned.sam
Step 8: Quality Control (Alignment) #
samtools flagstat results/alignment/aligned.sorted.bam > qc/alignment/flagstat.txt
samtools stats results/alignment/aligned.sorted.bam > qc/alignment/stats.txt
Step 9: Aggregate QC (Alignment) #
multiqc qc/ -o qc/multiqc/final/
Step 10: Gene Quantification (key step) #
featureCounts \
-T 10 \
-p \
-a reference/annotation.gtf \
-o results/counts/counts.txt \
results/alignment/aligned.sorted.bam