CS 59000: Bioinformatics Algorithms (Fall 2020)

Course Information

When: Monday/Wednesday 4:30 pm – 5:45 pm.
Where: LWSN B134.
Instructor: Jianzhu Ma, majianzhu@purdue.edu.
Office Hour: Thursday. 4:30 pm – 5:45 pm.or by appointment (send e-mail) Where: TBD
Online discussion is available at Blackboard (mycourses.purdue.edu).

Course description:

This course covers almost all the major research directions in bioinformatics and computational biology. It is a type of “all-you-need” class if you want to develop algorithms to process and understand biological data. By completing this course, students will encounter a handful of fundamental algorithmic approaches deriving straight from very widely cited primary literature, much of which has been published in recent years. The course also introduces basic concepts in statistics, mathematics, and machine learning including deep learning, which are all necessary to understand these approaches. Classes consist of both instructor and student presentations of their final projects (see Syllabus). It goes beyond just invoking existing popular programs and packages. Instead, the aim is for you to understand the underlying principles of the most successful techniques that are currently in use, and provide you with the capacity to design and implement the next generation of tools. Another goal of the course is to tackle the research frontiers of computational biology, and give you a glimpse of how research works, expose you to current bioinformatics research directions and guide you to find the problems most interesting to you.

Prerequisites:

This course is intended to be a first course in Bioinformatics algorithms for graduate students in computer science, computer engineering, mathematics, statistics, computational life sciences, and related fields. Students from computer science, the engineering disciplines, and mathematical sciences may be concerned about how much previous exposure to biological sciences they need to study bioinformatics. The answer is, not much—students do not need any previous knowledge of biology beyond high school. But you will need a willingness to learn biological concepts needed to understand the computational problems we study. We will begin with a quick review of some of the biological concepts needed for bioinformatics, and will learn more as needed.
Students are expected to have the following background:

  • Basic programming skills to write a reasonably non-trivial computer program in Python or C (e.g., CS 38003 or equivalent are recommended).

  • At least an undergraduate course in algorithms (a graduate course in algorithms is preferred).

  • Undergraduate or graduate level machine learning courses (e.g., CS 37300 and CS 578 are sufficient).

  • Students without this background should discuss their preparation with the instructor. The introduction session in the first week of the class will give an overview of the expected background.

Target Students

This course is intended to be a first course in Bioinformatics algorithms for graduate students in computer science, computer engineering, mathematics, statistics, computational life sciences, and related fields. This is suitable as a foundational course for students planning to do research in bioinformatics with a focus on algorithms; it is also suitable for those who wish to acquire some breadth in algorithms during graduate study and learn about important recent developments in bioinformatics.

Optional textbooks

“Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchison
“An introduction to bioinformatics algorithms” by Jones, Pevzner
“Network Biology:” section in Topology of molecular interaction networks
“Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Online book “Bioinformatics Algorithms” [link], Phillip Compeau and Pavel Pevzner

Grading

  • Homework (50%)

  • Final projects (50%)

Assignments and exams

There will be FIVE homework assignments including both theory and programming problems. Assignments should be submitted online in Brightspace, piazza or Blackboard systems. Programming projects can be written in both R and Python. Deep learning final projects should be written in either Tensorflow and PyTorch.

Late policy

Assignments are to be submitted by the due date listed. Assignments will NOT BE accepted if they are submitted late. Additional extensions will be granted only due to serious and documented medical or family emergencies.

Final projects

Students are encouraged to work as teams of two or three. Each team can either choose the provided projects on the course website or a separate project related to Biology. If you choose to do your own project, please contact me beforehand to discuss it. Here are the bioinformatics projects you might consider:

  • Easy projects:

    • Using deep neural networks to predict protein secondary structures.

    • Using deep neural networks to predict protein contact/distance map.

    • Implement a RNA secondary structure prediction algorithm.

    • Implement a single cell analysis pipeline.

    • Predicting protein functions prediction using multiple protein interaction networks.

    • Implement an algorithm to correct the batch effect in single cell data.

    • Predicting enhancers using genomics features.

  • Advanced projects:

    • Implement a (hierarchical) network community detection algorithms to recover Gene Ontology.

    • Using Contralign algorithm to improve the consistency of multiple network alignment.

    • Using network and deep learning to predict disease related genes.

    • Using deep learning to predict drug response for cancer.

    • Protein function prediction from interaction network using graph.

    • Comparison and analysis of methods in 3D genome reconstruction from Hi-C data.

    • Translation of genotype to phenotype by using the molecular networks.

Syllabus (tentative)

Sequence analysis in Bioinformatics (9 classes)

Time Topic Contents
08/24 Introduction (1) Motivation (2) Basic biology knowledge (3) Syllabus and grading policy (4) Final projects and presentations
Optional Reading:
What Is the Role for Algorithmics and Computational Biology in Responding to the COVID-19 Pandemic? [paper link]
08/26 Dynamic programming and sequence alignment (1) Dynamic programming (2) Local alignment (3) Global alignment
09/02 Homology search I (1) PAM score (2) BLOSUM score (3) Extreme Value Theory (4) Blast
09/07 Multiple sequence alignment (1) Star algorithms (2) CLUSTALW (3) T-coffee (4) M-coffee (5) MUSCLE (6) Probcons
09/09 Advanced sequence alignment algorithms (1) Protein sequence alignment (2) CNFpred (3) co-evolutation alignment (4) ILP-based alignment
09/14 Comparative Genomics (1) Whole genome alignment (2) Suffix tries (3) MUMmer
Optional Reading:
MUMmer 1.0: Alignment of Whole Genomes.
MUMmer 2.0: Fast Algorithms for Large-scale Genome Alignment and Comparision.
MUMmer 3.0: Versatile and open software for comparing large genomes.
MUMmer 4.0: "MUMmer4: A fast and versatile genome alignment system
09/16 Hidden Markov Models (1) Basic concepts (2) Probability estimation (3) Finding maximum assignment (4) Parameter estimation of HMM
09/21 Homology search II (1) HMM-based sequence alignment (2) HMMER (3) HHpred/HHsearch (4) HHblits (5) ILP-based methods
09/23 Molecular Evolution and Phylogenetics (1) Basics and definitions (2) Jukes Cantor distance (3) UPGMA and Neighbor joining algorithms (4) Sankoff algorithm (5) Fitch algorithm
Optional Reading:
Phylogenetic tree building in the genomic age [review article]

Structure analysis in Bioinformatics (4 classes)

Time Topic Contents
09/28 Protein structure alignment (1) Basic knowledge of protein structures (2) RMSD (3) SSAP algorithm, (4) DALI algorithm (4) LOCK algorithm (5) Structure databases
Optional reading:
TM-align: a protein structure alignment algorithm based on the TM-score [pdf]
Protein structure alignment beyond spatial proximity [pdf]
Matt: Local Flexibility Aids Protein Multiple Structure Alignment [pdf]
09/30 Protein structure prediction (1) Template-based modelling (2) Raptor (3) Template-free folding (4) Rosetta
Optional Reading:
(1) Fold proteins by playing games [link]
(2) Fold proteins by deep learning [link]
10/05 RNA structure prediction (1) RNA structures and functions (2) Nussinov Algorithm (3) Zucker algorithm (4) Stochastic context free grammars (SCFG)
Optional Reading:
RNA Secondary Structure Prediction By Learning Unrolled Algorithms [link]
10/07 3D genome analysis (1) Basic knowledge about HiC techniques (2) Algorithms to detect Topologically Associating Domains (TAD) (3) Algorithms to Predict Enhancer-Promoter interactions (PEP)
Optional reading:
1. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome [pdf]
2. Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions [pdf]
3. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping [pdf]
4. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure [pdf]
5. 3D structures of individual mammalian genomes studied by single-cell Hi-C [pdf]

Machine Learning in Bioinformatics (5 classes)

Time Topic Contents
10/12 Research talks Given by Prof. Majid Kazemian, Prof. Daisuke Kihara and Prof. Jianzhu Ma
10/14 Introduction to Deep Learning (1) Basic concepts, MLP (2) Convolutional Neural Network (3) Recurrent Neural Network (4) Long Short Term Memory (5) Resnet
Optional reading:
(1) Deep Learning in Bioinformatics [pdf]
(2) Deep learning for computational biology [pdf]
(3) Machine Learning in Genomic Medicine [pdf]
(4) Awesome DeepBio [github]
(5) PyTorch Performance Tuning Guide [youtube]
10/19 Deep Learning in genomics & protein structures (1) 1D CNN and DeepSea (2) Point cloud modelling and 3D CNN (3) Model interpretations Optional reading:
(1) Adaptive Deconvolutional Networks for Mid and High Level Feature Learning [pdf]
(2) Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps [pdf]
(3) The Mythos of Model Interpretability [pdf]
10/21 Deep Learning on biological networks (1) GraphSage (2) Graph Convolutional Neural Network (3) Graph Attention Models (4) Popular datasets
Optional reading:
(1) The graph neural network model [pdf]
(2) Gated Graph Sequence Neural Networks [pdf]
(3) Graph Representation Learning Book [website]
10/26 Deep Learning on small moleculars (1) (2) (3)

Matrix analysis in Bioinformatics (4 classes)

Time Topic Contents
10/28 Gene expression analysis (1) Differential expression analysis (2) Dimension reduction: PCA, TSNE
11/02 Single cell analysis I (1) Single cell techniques (2) Missing data imputation: MAGIC, DCA algorithms (3) Batch effect correction: MNN algorithm
Optional reading:
(1) Single-cell RNA sequencing technologies and bioinformatics pipelines [pdf]
(2) Comprehensive integration of single cell data [pdf]
(3) Single-cell RNA sequencing for the study of development, physiology and disease [pdf]
(4) A Systematic Evaluation of Single-cell RNA-sequencing Imputation Methods [pdf]
(5) A benchmark of batch-effect correction methods for single-cell RNA sequencing data [pdf]
11/04 Single cell analysis II (1) Trajectory embedding (2) RNA velocity (3) Bulk cell deconvolution
Optional reading:
(1) Lineage tracing meets single-cell omics: opportunities and challenges [pdf]
11/09 Single cell analysis III (1) Single cell multi-omics: single cell epigenetics, single cell HiC (2) Doublets and multiplets
Optional reading:
(1) Sparse Inverse Covariance Estimation with the Graphical Lasso [pdf]
(5) Foldit computer game [link]

Network analysis in Bioinformatics (4 classes)

Time Topic Contents
11/11 Network motif detection (1) Introduction of Network Biology (2) Basic concepts about graphs (3) Random graphs (4) Network motifs in biological networks (25) G-tries algorithm
11/16 Network Alignment (1) Network visualization (2) PathBLAST (3) IsoRank (4) Representation-based network alignments
Optional Reading:
(1) REGAL: Representation Learning-based Graph Alignment [pdf]
(2) Deep Adversarial Network Alignment [pdf]
11/18 Graphical Models I (1) General concepts (2) Exact inference (3) Sum-product algorithm (4) Max-product algorithm (5) Conditional Random Fields
Optional textbooks:
(1) “Probabilistic Graphical Models” by By Daphne Koller and Nir Friedman
(2) “Graphical Models, Exponential Families, and Variational Inference” by Martin J. Wainwright and Michael I. Jordan
(3) Chapter 8 of “Pattern Recognition and Machine Learning” by Christopher M. Bishop
11/23 Graphical Models II (1) Structure learning (2) Gaussian Graphical Model (3) Pseudo-likelihood approximation (4) Protein contact prediction (5) Deep-Learning-based structure learning
Optional reading:
(1) Sparse Inverse Covariance Estimation with the Graphical Lasso [pdf]
(2) High-Dimensional Graphs and Variable Selection with the Lasso [pdf]
(3) Protein 3D Structure Computed from Evolutionary Sequence Variation [pdf]
(4) AlphaFold: Using AI for scientific discovery [BLOG POST]
(5) Foldit computer game [link]

Project presentations (2 classes)

Time Topic Presentors
11/30 Project presentations TBD
12/02 Project presentations TBD