Skip to the content.

Mini- Projects for the Introduction to Bioinformatics Workshop

GROUP 1 PROJECT

Author: Itunuoluwa Isewon PhD

Email: itunu.isewon@covenantuniversity.edu.ng

Phylogenetic Analysis of Fungal 28S Sequences

Background

The 28S region of rDNA is widely used as a DNA barcode for fungi. In this project, students will analyse 28S sequences from five fungal species, along with a set of unknown fungal sequences. The aim is to use phylogenetic approaches to determine which species the unknown belongs to, and to compare different tree-building methods.

Objectives

• Perform multiple sequence alignment (MSA) of fungal 28S sequences. • Construct phylogenetic trees using Neighbour Joining (NJ), Maximum Likelihood (ML), and Minimum Evolution (ME) methods. • Compare phylogenetic tree topologies across methods. • Identify the most likely species for the unknown sequences.

Dataset Provided

📥 Dataset:

FASTA file containing:

• 28S sequences from 5 known fungal species

• Several unknown 28S sequences to classify

Download dataset here

Tasks

  1. Load the sequences Import the FASTA file into MEGA
  2. Multiple Sequence Alignment (MSA) Using ClustalW to align the 28S sequences.
  3. Save the alignment file.
  4. Phylogenetic Tree Construction: Build trees using three methods; • Neighbor Joining (NJ) • Maximum Likelihood (ML) • Minimum Evolution (ME)
  5. Visualize each tree and annotate species names.

Analysis

• Compare the positions of the unknown sequences in each tree. • Check if all methods agree on the classification of unknowns. • Note any differences in tree topology between methods.

Expected Output

  1. A ClustalW alignment file
  2. Three phylogenetic trees (NJ, ML, ME)
  3. A presentation including: • Figures of the phylogenetic trees • Identification of unknown sequences • Discussion of agreement/disagreement among methods

GROUP 2 PROJECT

Transcriptomics (RNA-Seq Differential Expression Analysis)

Introduction

In this mini-project, participants will perform differential gene expression analysis using RNA-seq data from the NCBI Gene Expression Omnibus (GEO) dataset GSE292521. Raw counts will be provided in a CSV format along with sample group information. This exercise will allow participants to gain hands-on experience with transcriptomics data analysis.

Summary of GSE292521 Experiment

Title: Genomics and Transcriptomics of 3ANX (NX-2) and NX (NX-3) producing isolates of Fusarium graminearum

Organism: Fusarium graminearum

Scope: Analysis of 20 fungal isolates from different regions in Manitoba, characterizing both genomic variations and gene expression profiles linked to mycotoxin chemotypes (3ANX and NX). The data illuminate differential expression patterns related to pathogenicity and suggest the 3ANX chemotype may be more widespread in Canada than previously recognised.

Platform: Illumina NovaSeq 6000 sequencing.

Objectives

By the end of this mini-project, participants will be able to:

• Import RNA-seq count data and metadata into R. • Conduct differential expression analysis using DESeq2. • Visualize results using MA plots, heatmaps, and volcano plots. • Interpret biological significance of differentially expressed genes.

📥 Dataset:

Dataset: GSE292521 (Available on GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE292521) Provided Files:

Download dataset here

Tasks

Participants are expected to carry out the following steps:

  1. Load count data and metadata into R.
  2. Normalize data using DESeq2.
  3. Conduct differential expression analysis.
  4. Generate visualization plots (volcano plot, heatmap).
  5. Identify top differentially expressed genes and their biological roles.

Required R Packages

The following R packages will be needed for this project:

Expected Output

By completing this project, participants should produce:

• A heatmap of top differentially expressed genes. • A volcano plot highlighting significant genes. • A list of significantly upregulated and downregulated genes. • A Group presentation showing the rationale, all plots, the top differentially expressed genes, their associated enriched KEGG pathways and Link findings to mycology/mycotoxicology relevance (e.g., stress response, secondary metabolite genes, toxin biosynthesis genes if present)

GROUP 3 PROJECT

End-to-End Genomic Data Analysis of Fungal Isolates (Galaxy)

Background

Whole-genome resequencing (WGS) enables variant discovery and comparative genomics across fungal isolates. In this project, students will process short-read FASTQ files from 5 known fungal species plus several unknown isolates. They will perform quality control, read trimming, reference-guided alignment, variant calling, and annotation in Galaxy.

Objectives

Dataset Provided

A project folder containing:

  1. 5 known fungal species (≥2 isolates each if available).
  2. Several unknown isolates to classify. A metadata sheet (samples.tsv) with columns: sample_id, status (Known/Unknown), species (for known), SRA_accession (if applicable), library_layout, read_group, notes. If you prefer, provide SRA accessions only; students will fetch reads inside Galaxy using “NCBI SRA Tools”.

Tasks

  1. Database Navigation (NCBI/ENA/SRA & Ensembl Fungi)

Goal: identify and download all inputs reproducibly.

Steps (students document each):

  1. Reference genome:
  1. Reads (if not pre-provided):
  1. Document database pages/screenshots + accessions in a short “Data Provenance” note.
  1. Load Data into Galaxy

Goal: organise the project as reproducible Galaxy histories & collections. Steps:

  1. Quality Control (Galaxy)

Tools & sequence (typical choices in parentheses):

  1. FastQC on all raw reads.

  2. MultiQC to summarize FastQC results.

  3. Adapter/quality trimming (e.g., Trim Galore! or fastp):

  1. FastQC (post-trim) → MultiQC (compare improvement). Output: MultiQC HTML reports for raw and trimmed reads.
  1. Alignment & Post-Processing (Galaxy) Reference choice:

Tools & steps (BWA-MEM2 pipeline example):

  1. BWA-MEM/MEM2: index reference.fasta; map paired reads → SAM.

  2. Samtools sort → BAM; Samtools index.

  3. Picard MarkDuplicates (or GATK MarkDuplicates).

  4. Alignment metrics:

  1. Variant Calling & Joint Genotyping (Galaxy)
  1. bcftools mpileup (per sample) with -Ou -f reference.fasta.

  2. bcftools call (per sample) with -mv (variants only) → per-sample VCF.

  3. bcftools merge (multi-sample) to form a joint VCF.

Route 2: FreeBayes (population calling)

Filtering (either route):

Output: a filtered multi-sample VCF.

  1. Variant Annotation (Galaxy)

Output: functionally annotated VCF + summary.

  1. Core SNP Matrix (Galaxy)

Goal: derive a core SNP alignment.

  1. Extract biallelic SNPs only (e.g., bcftools view -v snps -m2 -M2).

  2. Create alignment:

vcf2phylip (Galaxy wrapper) or SNP-sites to convert VCF → PHYLIP/FASTA SNP alignment. Analysis

Expected Output

  1. Galaxy artifacts
  1. Presentation (10–12 slides) Figures: MultiQC screenshots, pipeline schematic. Discussion of agreement/disagreement; sensitivity to filters; limitations.

  2. Reproducibility bundle Galaxy Workflow export (.ga), History export (.tar), and a README with tool versions/parameters.

Practical Tips & Parameter Hints