BIONFORMATCS AND BIOSTATISTICS
Academic year and teacher
If you can't find the course description that you're looking for in the above list,
please see the following instructions >>
- Versione italiana
- Academic year
- 2021/2022
- Teacher
- ANDREA BENAZZO
- Credits
- 6
- Didactic period
- Secondo Semestre
- SSD
- BIO/18
Training objectives
- The course aims to provide the skills necessary to understand how the genomic information is organized within prokaryotic and eukaryotic cells, and the bioinformatic and statistical tools useful to its characterization. In particular, students will learn the state of the art sequencing techniques useful to analyze whole genomes, with a specific focus on library preparation from biological samples, on Illumina/PacBio/Nanopore sequencing chemistry and on the bioinformatic pipelines specifically developed for sequencing data processing. The students will learn how to perform the bioinformatic treatment of sequencing data using a reference genome or perform a “de novo” assembly. Moreover, basic tools for the statistical analysis of complex biological processes and the interpretation of experimental data, will be explained. During the informatic lab sessions will be shown how to apply the bioinformatic tools to real sequencing data using the Linux environment and students will acquire experience in how to
- perform quality check on raw sequencing data,
- align reads to a reference genome,
- perform quality check on the alignment
- calling and filtering variants
The students will acquire the theoretical knowledge about genomic organization in different organisms, how to sequence the whole genome in an efficient way, the bioinformatic treatment of the sequencing data produced aimed to discover variant and how to use some statistical tools to draw conclusions from experimental data. Through practical activities in the informatic lab, the students will learn how to apply the commonly used bioinformatic tools to real sequencing data and how to apply the statistical theory to real biological examples. Prerequisites
- No preparatory course is needed. However, the bioinformatic analysis of genome sequencing data requires good knowledge of genetics, in particular of the laws of inheritance and of the mutational mechanisms. Moreover, the theoretical elements of advanced biostatisics require good knowledge of basic elements of statistics and their application to biological data requires a basic knowledge of how to use a computer.
Course programme
- Frontal lectures (40 hours, 5CFU) and informatics laboratory (12 hours, 1CFU) describing the following arguments:
- From DNA to proteins (2 hours). Transcription and translations. Different types of mutation: nucleotide and amminoacidic substitutions. Creation of genetic variability by insertion, deletions and recombination mechanisms.
- Genomes: structure, content, and organization in prokaryotic and eukaryotic organisms (4 hours)
- The human genome: an example of a complex eukaryotic genome (2 hours)
- New generation sequencing technologies (8 hours). Library preparation, sequencing and signal detection using the Illumina sequencing platform.
- Single molecule sequencing technologies (4 hours). Introduction to PacBio and Nanopore sequencing.
- “De novo” genome assembly (8 hours)
- Calling variants form sequencing data using a reference genome (6 hours). Quality check on reads. Efficient alignment algorithms. PCR duplicates removal. Indel realignment. SNP and genotype calling. Validation of variants.
- Linear correlations between numerical variables (2 hours): The correlation coefficient estimate, Hypothesis testing, Main assumptions, Nonparametric correlation.
- The regression (4 hours): Linear regression concept, quality of expectations, slope hypothesis testing, main assumptions, variable transformations, Measurement error, Non-linear regression.
- Practical session on bioinformatics (12 hours). Introduction to Linux and bash environment. Quality check of reads (with fastqc). Alignment of reads to a reference genome (using bwa). Basic operations of the alignments (with samtools). Alignment refinement (with samtools and GATK). Variant and genotype calling (with freebayes). Quality control of called variants (with vcflib). Didactic methods
- The course is composed by theoretical frontal lectures (40 hours) and practical sessions (12 hours) in the computer room, for 52 hours in total. Each lecture is provided using power-point slides and the blackboard for the explanation of theoretical concepts. Students will be introduced to the theoretical framework aimed to study whole genomes from the production of sequencing data, through the bioinformatic pipelines, to the variant discovery and statistical validation. All these concepts will be applied to real examples during the computer lab sessions using the Linux environment
Learning assessment procedures
- The aim of the exam is to verify at which level the learning objectives have been acquired. The exam is composed by 30 questions including short open questions, multiple choice and small exercises. 25 questions will be related to the theoretical arguments and the remaining 5 questions will be about the practical session. To pass the exam the students will have to answer at least 16/30 questions in 1 hour and 30 minutes.
Reference texts
- Next-Generation Sequencing Data Analysis di Xinkun Wang (CRC Press)
Bioinformatica, Pascarella e Paiardini, Zanichelli