00230212: Intelligent Biocomputation (智能生物计算) (Fall 2024)
Course Information
When: Monday 19:20 pm – 20:55 pm.
Where: 清华大学法律图书馆B112.
Instructor: Jianzhu Ma (马剑竹)
Email: majianzhu at tsinghua dot edu dot cn.
Office Hour: Tuesday. 19:30 pm – 20:30 pm.or by appointment (send email) Where: TBD
Course description:
This course encompasses all the principal areas within bioinformatics and computational biology, making it an 'allyouneed’ program for those aspiring to craft algorithms for analyzing and interpreting biological data. Students who complete this course will gain insights into a range of fundamental algorithmic strategies, drawn directly from seminal and highly referenced recent literature. The course imparts foundational knowledge in statistics, mathematics, and machine learning, including deep learning aspects. Both lectures and studentled presentations are provided, as detailed in the Syllabus. The curriculum advances beyond mere application of existing software; it endeavors to impart a thorough understanding of the core principles behind today's cuttingedge techniques, equipping students with the skills to devise and pioneer the forthcoming wave of bioinformatics tools.
Prerequisites:
This course is intended to be a first course in Bioinformatics for both undergraduate and graduate students in Electronic Engineering, Computer Science, Mathematics, Statistics, and Life Sciences.
Students from nonbiological disciplines may wonder about the level of prior biological knowledge required to pursue studies in bioinformatics. The answer is minimal; no advanced understanding beyond what is typically taught in high school biology is necessary. But you will need a willingness to learn biological concepts needed to understand the computational problems we study. We will begin with a quick review of some of the biological concepts needed for bioinformatics, and will learn more as needed.
Students are expected to have the following background:
Basic programming skills to write a reasonably nontrivial computer program in Python.
At least an undergraduate course in algorithms (a graduate course in algorithms is preferred).
Undergraduate or graduate level machine learning courses.
Students without this background should discuss their preparation with the instructor. The introduction session in the first week of the class will give an overview of the expected background.
Optional textbooks
“Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchison
“An introduction to bioinformatics algorithms” by Jones, Pevzner
“Network Biology:” section in Topology of molecular interaction networks
“Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Online book “Bioinformatics Algorithms” [link], Phillip Compeau and Pavel Pevzner
Grading
Homework (50%)
Final projects (50%)
Assignments
There will be FOUR homework assignments including both theory and programming problems.
Programming projects can be written in both R and Python. Deep learning final projects should be written in either Tensorflow and PyTorch.
Late policy
Assignments are to be submitted by the due date listed. Assignments will NOT BE accepted if they are submitted late. Additional extensions will be granted only due to serious and documented medical or family emergencies.
Final projects
Students are encouraged to work as teams of two or three. Each team can either choose the provided projects on the course website or a separate project related to Biology. If you choose to do your own project, please contact me beforehand to discuss it. Here are the bioinformatics projects you might consider:
Easy projects:
Using deep neural networks to predict protein secondary structures.
Using deep neural networks to predict protein contact/distance map.
Implement a RNA secondary structure prediction algorithm.
Implement a single cell analysis pipeline.
Predicting protein functions prediction using multiple protein interaction networks.
Implement an algorithm to correct the batch effect in single cell data.
Predicting enhancers using genomics features.
Advanced projects:
Compression of protein multiple sequence alignment files.
Implement a (hierarchical) network community detection algorithms to recover Gene Ontology.
Protein function prediction from interaction network using graph.
Syllabus (tentative)
Sequence Analysis (3 classes)
Time  Topic  Contents 
09/09  Introduction & Sequence alignment  (1) Basic biology knowledge; (2) Syllabus and grading policy; (3) Final projects and presentations; (4) Dynamic programming; (5) Local alignment; (6) Global alignment Optional Reading: What Is the Role for Algorithmics and Computational Biology in Responding to the COVID19 Pandemic? [paper link] 
09/23  Homology search (I)  (1) Homology; (2) PAM; (3) BLOSUM; (4) Extreme Value Theory; (5) Blast; (6) PSIblast; (7) Multiple sequence alignment 
09/30  Homology search (II)  (1) Hidden Markov Models; (2) HMMbased sequence alignment; (3) HMMER; (4) HHpred/HHsearch; (5) HHblits

Introduction to Machine Learning (2 classes)
Time  Topic  Contents 
10/14  Introduction to Machine Learning  (1) Introduction (2) AI & Neural Network history; (3) ML task, Overfitting, Underfitting; (4) Linear Regression; (5) MLE; (6) Neural Network; (7) Gradient Descent; (8) Backpropagation algorithm Optional reading: (1) Deep Learning in Bioinformatics [pdf] (2) Deep learning for computational biology [pdf] (3) Machine Learning in Genomic Medicine [pdf] (4) Awesome DeepBio [github] (5) PyTorch Performance Tuning Guide [youtube] (6) The graph neural network model [pdf] (7) Graph Representation Learning Book [website] 
10/21  Introduction to Deep Learning  (1) Convolutional Neural Network; (2) Recurrent Neural Network; (3) Graph Neural Network; (4) Transformer Optional reading: (1) Attention Is All You Need [pdf] (2) Transformer from scratch using pytorch [link]

Structure Biology (3 classes)
Time  Topic  Contents 
10/28  Protein structure prediction  (1) Templatebased modelling; (2) Coevolutation; (3) Templatefree folding; (4) RFdiffusion & ProteinMPNN; (5) Protein structure alignment Optional Reading: (1) Fold proteins by playing games [link] (2) Fold proteins by deep learning [link] (3) TMalign: a protein structure alignment algorithm based on the TMscore [pdf] (4) Protein structure alignment beyond spatial proximity [pdf] (5) Matt: Local Flexibility Aids Protein Multiple Structure Alignment [pdf] 
11/04  RNA structure prediction  (1) RNA structures and functions; (2) Nussinov Algorithm; (3) Zucker algorithm; (4) Stochastic context free grammars (SCFG) Optional Reading: RNA Secondary Structure Prediction By Learning Unrolled Algorithms [link] 
11/11  AlphaFold  (1) MSA transformer; (2) Structural Module; (3) FAPE loss; (4) Triangle Multiplication Optional Reading: (1) Improved protein structure prediction using potentials from deep learning [pdf] (2) Highly accurate protein structure prediction with AlphaFold [pdf] (3) Accurate structure prediction of biomolecular interactions with AlphaFold 3 [pdf]

AI Drug Development (1 classes)
Time  Topic  Contents 
11/18  CADD & AIDD  (1) Small Molecule Drug; (2) Antibody; (3) Mathematical Representation; (3) Structurebased drug design; (4) ADMET prediction; (5) Retrosynthesis algorithm; (6) Drug Repositioning 

Single Cell in Bioinformatics (2 classes)
Time  Topic  Contents 
11/25  Single cell analysis (I)  (1) Intro to single cell technology; (2) Differential expression analysis; (3) Dimension reduction: PCA, TSNE; (4) Missing data imputation: MAGIC, DCA algorithms; (5) Batch effect correction: MNN algorithm Optional reading: (1) Singlecell RNA sequencing technologies and bioinformatics pipelines [pdf] (2) Comprehensive integration of single cell data [pdf] (3) Singlecell RNA sequencing for the study of development, physiology and disease [pdf] (4) A Systematic Evaluation of Singlecell RNAsequencing Imputation Methods [pdf] (5) A benchmark of batcheffect correction methods for singlecell RNA sequencing data [pdf] 
12/02  Single cell analysis (II)  (1) Trajectory embedding; (2) RNA velocity; (3) Bulk cell deconvolution; (4) Multiomics integration; (5) Doublets and multiplet detection Optional reading: (1) Lineage tracing meets singlecell omics: opportunities and challenges [pdf] (2) Sparse Inverse Covariance Estimation with the Graphical Lasso [pdf] (3) Foldit computer game [link]

Network in Bioinformatics (2 classes)
Time  Topic  Contents 
12/09  Network motif detection  (1) Introduction of Network Biology (2) Basic concepts about graphs (3) Random graphs (4) Network motifs in biological networks (25) Gtries algorithm 
12/16  Network Alignment  (1) Network visualization (2) PathBLAST (3) IsoRank (4) Representationbased network alignments Optional Reading: (1) REGAL: Representation Learningbased Graph Alignment [pdf] (2) Deep Adversarial Network Alignment [pdf]

Project presentations (1 classes)
Time  Topic  Presentors 
12/23  Project presentations  TBD

