Mondays include a didactic lecture and a new homework assignment. Fridays include team presentations of the homework, and interactive activities based on that week’s topic.
Part 1: Statistical Foundations in Excel
Week 1 – Course Introduction
We introduce the structure and goals of the course, discuss the importance of data literacy in biomedical science, and explore a set of real-world datasets (dogs, music, movies, etc.) that will serve as our training material. Students select their dataset for analysis and apply a UMAP app for visualization.
Week 2 – Understanding Probability Distributions
This week reviews histograms, distributions, that underly the selection of statistical tests. We explore how bin sizes, log/linear scales, and defines critical descriptive statistics: mean, standard deviation, mode, p-value and z-score.
Week 3 – Statistical Tests
This week reviews visualization and testing of multiple groups within large datasets/distributions. We distinguish inferential statistics vs descriptive statistics and how the major statistical tests evaluate categories or numeric measurements or combinations. Emphasis is placed on the T-test due to its general applicability given the central limit theorem.
Week 4 – Correlation and Linear Models
We cover scatter plots, correlation patterns, joint distributions, and linear regression. Emphasis is placed on visualizing noisy data, interpreting trends, and understanding the statistical assumptions behind regression. We define essential predictive statistics associated with effect-size and predictive power of one measurement for another.
Week 5 – Matrix Visualization and Dimensionality Reduction
This week introduces heatmaps, matrix organization, z-scoring, and the concepts behind PCA and UMAP. Students learn how to interpret UMAP vs PCA plots and build heatmaps manually in excel.
Part 2: Data Science in R
Week 6 – Intro to R: coding and distributions
Students are introduced to the R programming environment, and re-implement histogram-based statistical testing using R. The concept of empirical p-values and automation of comparisons using loops is introduced.
Week 7 – Statistical Tests in R
Students revisit visualization of groups and applications of statistical tests in R (to compare with previous excel based activities). Large scale automation of statistical tests is introduced using loops. False Discovery Rate is discussed as a new concept that emerges from large scale application of statistical tests.
Week 8 – Linear regression in R
Students revisit linear regression and evaluation of predictive models in R (to compare with previous excel based activities). Multiple-linear regression is introduced along with essential concepts in machine-learning that emerge from multi-variate analyses (e.g. feature selection, overfitting).
Week 9 – Heatmaps and Multi-Omic Datasets in R
Students revisit dimensionality reduction and heatmap generation in R (to compare with previous excel based activities). Different methods to cluster or select rows/columns in matrices are discussed to faciliate analyses of multiple measurements and multiple samples.
Part 3: Biomedical Data Science
Week 10 – Biomedical Data Types & Technologies
This week surveys major categories of biomedical data including omics, drug screening, and literature-derived data. Students explore gene expression data, perform gene-specific visualizations using apps, and apply heatmap tools from the Human Protein Atlas. Emphasis is on pathway analysis and working with large expression matrices.
Week 11 – Bulk RNA-Seq
This week introduces RNA-seq processing pipelines including data download, normalization, enrichment analysis, and deconvolution. Students practice basic bulk RNA-seq workflows in R.
Week 12-13 – Single-Cell RNA-Seq
Students learn how to process and explore single-cell RNA-seq data using Seurat, define clusters, and identify communication networks. Methods include pseudobulk strategies and network-based analysis.
Week 14-15 – Spatial Transcriptomics & Final Projects
Spatial data is introduced with a focus on comparing scRNA-seq and spatial data, analyzing spatial neighborhoods, and identifying spatial laws that govern gene expression. Students use R-based spatial apps to explore structure. Students apply single-cell or spatial workflows to a data-set directly relevant to their thesis project.