00230212: Intelligent Bio-computation (智能生物计算) (Fall 2024)

Course Information

When: Monday 19:20 pm – 20:55 pm.
Where: 清华大学法律图书馆B112.
Instructor: Jianzhu Ma (马剑竹) Email: majianzhu at tsinghua dot edu dot cn.
Office Hour: Tuesday. 19:30 pm – 20:30 pm.or by appointment (send e-mail) Where: TBD

Course description:

This course encompasses all the principal areas within bioinformatics and computational biology, making it an 'all-you-need’ program for those aspiring to craft algorithms for analyzing and interpreting biological data. Students who complete this course will gain insights into a range of fundamental algorithmic strategies, drawn directly from seminal and highly referenced recent literature. The course imparts foundational knowledge in statistics, mathematics, and machine learning, including deep learning aspects. Both lectures and student-led presentations are provided, as detailed in the Syllabus. The curriculum advances beyond mere application of existing software; it endeavors to impart a thorough understanding of the core principles behind today's cutting-edge techniques, equipping students with the skills to devise and pioneer the forthcoming wave of bioinformatics tools.

Prerequisites:

This course is intended to be a first course in Bioinformatics for both undergraduate and graduate students in Electronic Engineering, Computer Science, Mathematics, Statistics, and Life Sciences. Students from non-biological disciplines may wonder about the level of prior biological knowledge required to pursue studies in bioinformatics. The answer is minimal; no advanced understanding beyond what is typically taught in high school biology is necessary. But you will need a willingness to learn biological concepts needed to understand the computational problems we study. We will begin with a quick review of some of the biological concepts needed for bioinformatics, and will learn more as needed.
Students are expected to have the following background:

  • Basic programming skills to write a reasonably non-trivial computer program in Python.

  • At least an undergraduate course in algorithms (a graduate course in algorithms is preferred).

  • Undergraduate or graduate level machine learning courses.

  • Students without this background should discuss their preparation with the instructor. The introduction session in the first week of the class will give an overview of the expected background.

Optional textbooks

“Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchison
“An introduction to bioinformatics algorithms” by Jones, Pevzner
“Network Biology:” section in Topology of molecular interaction networks
“Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Online book “Bioinformatics Algorithms” [link], Phillip Compeau and Pavel Pevzner

Grading

  • Homework (50%)

  • Final projects (50%)

Assignments

There will be FOUR homework assignments including both theory and programming problems. Programming projects can be written in both R and Python. Deep learning final projects should be written in either Tensorflow and PyTorch.

Late policy

Assignments are to be submitted by the due date listed. Assignments will NOT BE accepted if they are submitted late. Additional extensions will be granted only due to serious and documented medical or family emergencies.

Final projects

Students are encouraged to work as teams of two or three. Each team can either choose the provided projects on the course website or a separate project related to Biology. If you choose to do your own project, please contact me beforehand to discuss it. Here are the bioinformatics projects you might consider:

  • Easy projects:

    • Using deep neural networks to predict protein secondary structures.

    • Using deep neural networks to predict protein contact/distance map.

    • Implement a RNA secondary structure prediction algorithm.

    • Implement a single cell analysis pipeline.

    • Predicting protein functions prediction using multiple protein interaction networks.

    • Implement an algorithm to correct the batch effect in single cell data.

    • Predicting enhancers using genomics features.

  • Advanced projects:

    • Compression of protein multiple sequence alignment files.

    • Implement a (hierarchical) network community detection algorithms to recover Gene Ontology.

    • Protein function prediction from interaction network using graph.

Syllabus (tentative)

Sequence Analysis (3 classes)

Time Topic Contents
09/09 Introduction & Sequence alignment (1) Basic biology knowledge; (2) Syllabus and grading policy; (3) Final projects and presentations; (4) Dynamic programming; (5) Local alignment; (6) Global alignment
Optional Reading:
What Is the Role for Algorithmics and Computational Biology in Responding to the COVID-19 Pandemic? [paper link]
09/23 Homology search (I) (1) Homology; (2) PAM; (3) BLOSUM; (4) Extreme Value Theory; (5) Blast; (6) PSIblast; (7) Multiple sequence alignment
09/30 Homology search (II) (1) Hidden Markov Models; (2) HMM-based sequence alignment; (3) HMMER; (4) HHpred/HHsearch; (5) HHblits

Introduction to Machine Learning (2 classes)

Time Topic Contents
10/14 Introduction to Machine Learning (1) Introduction (2) AI & Neural Network history; (3) ML task, Overfitting, Underfitting; (4) Linear Regression; (5) MLE; (6) Neural Network; (7) Gradient Descent; (8) Backpropagation algorithm Optional reading:
(1) Deep Learning in Bioinformatics [pdf]
(2) Deep learning for computational biology [pdf]
(3) Machine Learning in Genomic Medicine [pdf]
(4) Awesome DeepBio [github]
(5) PyTorch Performance Tuning Guide [youtube]
(6) The graph neural network model [pdf]
(7) Graph Representation Learning Book [website]
10/21 Introduction to Deep Learning (1) Convolutional Neural Network; (2) Recurrent Neural Network; (3) Graph Neural Network; (4) Transformer
Optional reading:
(1) Attention Is All You Need [pdf]
(2) Transformer from scratch using pytorch [link]

Structure Biology (3 classes)

Time Topic Contents
10/28 Protein structure prediction (1) Template-based modelling; (2) Co-evolutation; (3) Template-free folding; (4) RFdiffusion & ProteinMPNN; (5) Protein structure alignment
Optional Reading:
(1) Fold proteins by playing games [link]
(2) Fold proteins by deep learning [link]
(3) TM-align: a protein structure alignment algorithm based on the TM-score [pdf]
(4) Protein structure alignment beyond spatial proximity [pdf]
(5) Matt: Local Flexibility Aids Protein Multiple Structure Alignment [pdf]
11/04 RNA structure prediction (1) RNA structures and functions; (2) Nussinov Algorithm; (3) Zucker algorithm; (4) Stochastic context free grammars (SCFG)
Optional Reading:
RNA Secondary Structure Prediction By Learning Unrolled Algorithms [link]
11/11 AlphaFold (1) MSA transformer; (2) Structural Module; (3) FAPE loss; (4) Triangle Multiplication
Optional Reading:
(1) Improved protein structure prediction using potentials from deep learning [pdf]
(2) Highly accurate protein structure prediction with AlphaFold [pdf]
(3) Accurate structure prediction of biomolecular interactions with AlphaFold 3 [pdf]

AI Drug Development (1 classes)

Time Topic Contents
11/18 CADD & AIDD (1) Small Molecule Drug; (2) Antibody; (3) Mathematical Representation; (3) Structure-based drug design; (4) ADMET prediction; (5) Retrosynthesis algorithm; (6) Drug Repositioning

Single Cell in Bioinformatics (2 classes)

Time Topic Contents
11/25 Single cell analysis (I) (1) Intro to single cell technology; (2) Differential expression analysis; (3) Dimension reduction: PCA, TSNE; (4) Missing data imputation: MAGIC, DCA algorithms; (5) Batch effect correction: MNN algorithm
Optional reading:
(1) Single-cell RNA sequencing technologies and bioinformatics pipelines [pdf]
(2) Comprehensive integration of single cell data [pdf]
(3) Single-cell RNA sequencing for the study of development, physiology and disease [pdf]
(4) A Systematic Evaluation of Single-cell RNA-sequencing Imputation Methods [pdf]
(5) A benchmark of batch-effect correction methods for single-cell RNA sequencing data [pdf]
12/02 Single cell analysis (II) (1) Trajectory embedding; (2) RNA velocity; (3) Bulk cell deconvolution; (4) Multi-omics integration; (5) Doublets and multiplet detection
Optional reading:
(1) Lineage tracing meets single-cell omics: opportunities and challenges [pdf]
(2) Sparse Inverse Covariance Estimation with the Graphical Lasso [pdf]
(3) Foldit computer game [link]

Network in Bioinformatics (2 classes)

Time Topic Contents
12/09 Network motif detection (1) Introduction of Network Biology (2) Basic concepts about graphs (3) Random graphs (4) Network motifs in biological networks (25) G-tries algorithm
12/16 Network Alignment (1) Network visualization (2) PathBLAST (3) IsoRank (4) Representation-based network alignments
Optional Reading:
(1) REGAL: Representation Learning-based Graph Alignment [pdf]
(2) Deep Adversarial Network Alignment [pdf]

Project presentations (1 classes)

Time Topic Presentors
12/23 Project presentations TBD