Differential Expression with Seurat: All in One View

Content from Introduction

Last updated on 2025-12-02 | Edit this page

Estimated time: 12 minutes

Overview

Questions

How to perform differential expression analysis with Seurat?

Objectives

Gain more familiarity with standard scRNA-seq Quality Control (QC) steps.
Understand and get comfortable using various integration strategies (harmoy and seuratCCA).
Understand when and how to use all of the differential expression functions offered by Seurat: FindMarkers(), FindConservedMarkers(), and FindAllMarkers().
Learn how to use differential expression tools meant for bulk data, like DESeq2, for single-cell ‘pseudobulk’ data and understand why you might choose this approach.
Learn different ways to visualise both in-built Seurat functions and external packages like pheatmap.

Callout

This tutorial is partially based on existing material from:

Content from Setup, Quality Control and Sample Integration

Last updated on 2025-12-02 | Edit this page

Estimated time: 12 minutes

Overview

Questions

How do we identify and remove low-quality cells in scRNA-seq data?
What signs suggest batch effects between treatment conditions?
When and why do we need to integrate datasets before downstream analysis?
How do Harmony and Seurat CCA compare in aligning similar cell types?

Objectives

Load and inspect a Seurat single-cell dataset (e.g., ifnb).
Perform basic quality control and filtering using mitochondrial content and feature counts.
Visualise data distributions with violin plots and scatterplots.
Recognise when integration is needed and apply both Harmony and CCA integration methods.
Perform initial clustering and visualise condition alignment in UMAP space.

Step 1. Load the packages and data

Today we’ll be working with Seurat (a popular scRNA-seq analysis package). SeuratData will be used to load in the experimental data we’re analysing. Tidyverse is a fundamental and very popularly used set of tools to wrangle and visualise data.

We’ll need to load the DESeq2 R package for when we explore pseudobulk DE approaches

pheatmap and grid are two really useful packages for creating custom heatmaps with our scRNA-seq data and exporting figures, respectively.

R

library(Seurat)
library(SeuratData)
library(tidyverse)
library(DESeq2)
library(patchwork)
library(pheatmap)
library(grid)
library(metap)
library(harmony)
library(DropletUtils)
library(ggplot2)
library(SingleR)
library(Celldex)

set.seed(4242) # Set Seed for Reproducibility

We’re using the ifnb public dataset provided by Seurat. This dataset contains PBMC data from 8 lupus patients before and after interferon beta treatment.

I strongly encourage you to explore the other datasets offered by the SeuratData package, it can be really good practice in your spare time.

The ifnb Seurat object we’re loading in here was originally made in Seurat v4, there have since been a lot of changes from Seurat v4 to v5 so we’ll use the UpdateSeuratObject() function to update the Seurat object so that it is compatible for today.

R

head(AvailableData()) # if you want to see the available SeuratData datasets use this function

R

InstallData("ifnb") # install our treatment vs control dataset for today

data("ifnb") # Load the dataset into our current R script
ifnb <- UpdateSeuratObject(ifnb) # Make sure the seurat object is in the format of Seurat v5

str(ifnb) # we can use this to take a look at the information in our Seurat Object

Challenge

Looking at the output from the str() function above, can you tell whether this seurat object is processed or unprocessed?

Show me the solution

When loading in seurat objects, we can have a look at what processing steps have been performed on it by using the str() function. In the output we can tell that the ifnb Seurat object is unprocessed because the scale.data slot is empty, no variable features have been identified, and no dimensionality reduction functions have been performed.

Step 2: Run QC, filter out low quality cells

Lets start by processing our data (run the standard seurat workflow steps including preprocessing and filtering).

First we need to take a look at QC metrics, then decide on the thresholds for filtering.

Challenge

QC for droplet-based protocols

Show me the solution

In droplet-based protocols (e.g., 10x Genomics), millions of droplets are formed, but only some droplets contain exactly one real cell. Several types of “bad droplets” appear:

Empty droplets (no real cell)
Doublets (two cells in one droplet)

R

# Step 2a: QC and filtering
ifnb$percent.mt <- PercentageFeatureSet(object = ifnb, pattern = "^MT-") # First let's annotate the mitochondrial percentage for each cell

head((ifnb@meta.data)) # we can take a look mitochondrial percentages for the seurat object by viewing the seurat objects metadata

OUTPUT

                  orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
AAACATACCTGGTA.1 IMMUNE_CTRL       3156         1109 CTRL                pDC
AAACATACGATGAA.1 IMMUNE_CTRL       1868          634 CTRL       CD4 Memory T
AAACATACGGCATT.1 IMMUNE_CTRL       1581          557 CTRL          CD14 Mono
                 percent.mt
AAACATACATTTCC.1          0
AAACATACCAGAAA.1          0
AAACATACCTCGCT.1          0
AAACATACCTGGTA.1          0
AAACATACGATGAA.1          0
AAACATACGGCATT.1          0

R

# Step 2b: Visualise QC metrics and identify filtering thresholds
qc.metric.plts <- VlnPlot(ifnb, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) + 
  ggtitle("Before Filtering")

WARNING

Warning in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents =
idents, : All cells have the same value of percent.mt.

R

association.plt.raw <- FeatureScatter(ifnb, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = "lm") +
  ggtitle("Before Filtering")

qc.metric.plts

R

association.plt.raw

OUTPUT

`geom_smooth()` using formula = 'y ~ x'

Discussion

Looking at the violin plots of QC metrics, what do you think about the overall quality of the ifnb dataset?

After visualising QC metrics, we’ll move on to the actual filtering

R

# Step 2c: filter out low-quality cells + visualise the metrics for our filtered seurat object
ifnb.filtered <- subset(ifnb, subset = nCount_RNA > 800 & 
                          nCount_RNA < 5000 &
                          nFeature_RNA > 200 &
                          nFeature_RNA < 1200 &
                          percent.mt < 5)

qc.metric.plts.filtered <- VlnPlot(ifnb.filtered, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) +
  ggtitle("After Filtering")
association.plt.filtered <- FeatureScatter(ifnb.filtered, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = "lm") +
  ggtitle("After Filtering")

qc.metric.plts.filtered

R

association.plt.filtered

Let’s check how many cells we’ve filtered out (looks like ~400 cells were removed):

R

## Defining a couple helper functions to standardise x and y axis for two plots
get_plot_range <- function(plot) {
  data <- layer_data(plot)
  list(
    x = range(data$x, na.rm = TRUE),
    y = range(data$y, na.rm = TRUE)
  )
}
standardise_plt_scale <- function(plt1, plt2){

  # Get ranges for both plots
  range_raw <- get_plot_range(plt1)
  range_filtered <- get_plot_range(plt2)
  
  # Calculate overall range
  x_range <- range(c(range_raw$x, range_filtered$x))
  y_range <- range(c(range_raw$y, range_filtered$y))
  
  suppressMessages({
  # Update both plots with the same x and y scales
  association.plt.raw <- association.plt.raw +
    scale_x_continuous(limits = x_range) +
    scale_y_continuous(limits = y_range)
  
  association.plt.filtered <- association.plt.filtered +
    scale_x_continuous(limits = x_range) +
    scale_y_continuous(limits = y_range)
  })
  
  # Wrap the plots
  wrapped_plots <- wrap_plots(list(association.plt.raw, association.plt.filtered), 
                              ncol = 2)

  return(wrapped_plots)
}

wrap_plots(list(qc.metric.plts, qc.metric.plts.filtered), 
           ncol = 1)

R

association.plts <- standardise_plt_scale(association.plt.raw,
                                          association.plt.filtered)
association.plts

Let’s check how many cells we’ve filtered out (looks like ~400 cells were removed):

R

ifnb

OUTPUT

An object of class Seurat
14053 features across 13999 samples within 1 assay
Active assay: RNA (14053 features, 0 variable features)
 2 layers present: counts, data

R

ifnb.filtered

OUTPUT

An object of class Seurat
14053 features across 13548 samples within 1 assay
Active assay: RNA (14053 features, 0 variable features)
 2 layers present: counts, data

Next we need to split our count matrices based on conditions. This step stores stimulated versus unstimulated expression information separately, creating a list of RNA assays grouped by the “stim” condition. Note: this is important for downstream integration steps in Seurat v5.

R

ifnb.filtered[["RNA"]] <- split(ifnb.filtered[["RNA"]], f = ifnb.filtered$stim) # Lets split our count matrices based on conditions (stored within different layers) -> needed for integration steps in Seurat v5

Step 3: Before performing differential expression between the two conditions, let’s assess whether we need to integrate our data

After filtering out low quality cells, we want to visualise our data to see how cells group by condition and if we need to perform batch-effect correction (integration)

R

ifnb.filtered <- NormalizeData(ifnb.filtered)

OUTPUT

Normalizing layer: counts.CTRL

OUTPUT

Normalizing layer: counts.STIM

R

ifnb.filtered <- FindVariableFeatures(ifnb.filtered)

OUTPUT

Finding variable features for layer counts.CTRL

OUTPUT

Finding variable features for layer counts.STIM

R

ifnb.filtered <- ScaleData(ifnb.filtered)

OUTPUT

Centering and scaling data matrix

R

## Centering and scaling data matrix

ifnb.filtered <- RunPCA(ifnb.filtered)

OUTPUT

PC_ 1
Positive:  TYROBP, C15orf48, FCER1G, CST3, SOD2, ANXA5, FTL, TYMP, TIMP1, CD63
	   LGALS1, CTSB, S100A4, KYNU, LGALS3, FCN1, PSAP, NPC2, ANXA2, IGSF6
	   S100A11, LYZ, SPI1, APOBEC3A, CD68, CTSL, NINJ1, HLA-DRA, CCL2, SDCBP
Negative:  NPM1, CCR7, CXCR4, GIMAP7, LTB, CD3D, CD7, SELL, TMSB4X, CD2
	   TRAT1, IL7R, PTPRCAP, IL32, ITM2A, RGCC, LEF1, CD3G, ALOX5AP, CREM
	   PASK, MYC, SNHG8, TSC22D3, BIRC3, GPR171, NOP58, CD27, RARRES3, CD8B
PC_ 2
Positive:  ISG15, ISG20, IFIT3, IFIT1, LY6E, TNFSF10, IFIT2, MX1, IFI6, RSAD2
	   CXCL10, OAS1, CXCL11, IFITM3, MT2A, OASL, TNFSF13B, IDO1, IL1RN, APOBEC3A
	   CCL8, GBP1, HERC5, FAM26F, GBP4, RABGAP1L, HES4, WARS, VAMP5, DEFB1
Negative:  IL8, CLEC5A, CD14, VCAN, S100A8, IER3, MARCKSL1, IL1B, PID1, CD9
	   GPX1, INSIG1, PHLDA1, PLAUR, PPIF, THBS1, OSM, SLC7A11, CTB-61M7.2, GAPDH
	   LIMS1, S100A9, GAPT, ACTB, CXCL3, C19orf59, MGST1, OLR1, CEBPB, FTH1
PC_ 3
Positive:  HLA-DQA1, CD83, HLA-DQB1, CD74, HLA-DRA, HLA-DPA1, HLA-DRB1, CD79A, HLA-DPB1, IRF8
	   MS4A1, SYNGR2, MIR155HG, HERPUD1, REL, HSP90AB1, ID3, HLA-DMA, TVP23A, FABP5
	   NME1, HSPE1, PMAIP1, BANK1, CD70, HSPD1, TSPAN13, EBI3, TCF4, CCR7
Negative:  ANXA1, GNLY, NKG7, GIMAP7, TMSB4X, PRF1, CD7, CCL5, RARRES3, CD3D
	   CD2, KLRD1, GZMH, GZMA, CTSW, GZMB, FGFBP2, CLIC3, IL32, MT2A
	   FASLG, KLRC1, CST7, RGCC, CD8A, GCHFR, OASL, GZMM, CXCR3, KLRB1
PC_ 4
Positive:  LTB, SELL, CCR7, LEF1, IL7R, CD3D, TRAT1, GIMAP7, ADTRP, PASK
	   CD3G, TSHZ2, CMTM8, SOCS3, TSC22D3, NPM1, CCL2, MYC, CCL7, CCL8
	   CTSL, SNHG8, TXNIP, CD27, S100A9, CA6, C12orf57, TMEM204, HPSE, GPR171
Negative:  NKG7, GZMB, GNLY, CST7, PRF1, CCL5, CLIC3, KLRD1, APOBEC3G, GZMH
	   GZMA, CTSW, FGFBP2, KLRC1, FASLG, C1orf21, HOPX, SH2D1B, TNFRSF18, CXCR3
	   LINC00996, SPON2, RAMP1, ID2, GCHFR, IGFBP7, HLA-DPA1, CD74, XCL2, HLA-DPB1
PC_ 5
Positive:  CCL2, CCL7, CCL8, PLA2G7, TXN, LMNA, SDS, S100A9, CSTB, ATP6V1F
	   CAPG, CCR1, EMP1, FABP5, CCR5, IDO1, TPM4, LILRB4, MGST1, CTSB
	   HPSE, CCNA1, GCLM, PDE4DIP, HSPA1A, CD63, SLC7A11, HSPA5, VIM, HSP90B1
Negative:  VMO1, FCGR3A, MS4A4A, CXCL16, MS4A7, PPM1N, HN1, LST1, SMPDL3A, ATP1B3
	   CASP5, CDKN1C, AIF1, CH25H, PLAC8, SERPINA1, TMSB4X, LRRC25, CD86, GBP5
	   HCAR3, RP11-290F20.3, COTL1, RGS19, VNN2, PILRA, STXBP2, LILRA5, C3AR1, FCGR3B

R

ElbowPlot(ifnb.filtered) # Visualise the dimensionality of the data, looks like 15 PCs is adequate to capture the majority of the variation in the data, but we'll air on the higher side and consider all 20 dimensions.

R

DimPlot(ifnb.filtered, reduction = 'pca', group.by = 'stim') # lets see how our cells separate by condition and whether integration is necessary

Challenge

Cell Cycle Check 1 — BEFORE integration (after PCA / pre-Harmony / CCA)

Seurat also features a function, CellCyleScoring to calculate which phase each individual cell is in the cell cycle using canonical markers. You can read more about it here.

Which phase in the cell cycle are the clusters in primarily? Are they different or the same between clusters?

Show me the solution

R

# ---- Cell-cycle check (PRE-integration) ----
if (!exists("cc.genes.updated.2019")) data("cc.genes.updated.2019", package = "Seurat")

# Score S/G2M
ifnb.filtered <- CellCycleScoring(
  ifnb.filtered,
  s.features   = cc.genes.updated.2019$s.genes,
  g2m.features = cc.genes.updated.2019$g2m.genes,
  set.ident    = FALSE,
  search       = TRUE
)

# Quick UMAP on PCA (if you haven't already run it)
if (!"umap" %in% Reductions(ifnb.filtered)) {
  ifnb.filtered <- RunUMAP(ifnb.filtered, dims = 1:20, reduction = "pca")
}

# Visual + quick quant
DimPlot(ifnb.filtered, reduction = "umap", group.by = "Phase", pt.size = 0.3)

R

emb_pca <- Embeddings(ifnb.filtered, "pca")[,1:20]
pc_cor_S   <- sapply(1:20, \(i) cor(emb_pca[,i], ifnb.filtered$S.Score))
pc_cor_G2M <- sapply(1:20, \(i) cor(emb_pca[,i], ifnb.filtered$G2M.Score))
print(cbind(PC=1:20, r_S=round(pc_cor_S,3), r_G2M=round(pc_cor_G2M,3)))

OUTPUT

      PC    r_S  r_G2M
 [1,]  1 -0.262 -0.245
 [2,]  2  0.065  0.048
 [3,]  3  0.023  0.038
 [4,]  4  0.002 -0.020
 [5,]  5 -0.003  0.015
 [6,]  6 -0.003  0.090
 [7,]  7 -0.045 -0.038
 [8,]  8  0.007 -0.010
 [9,]  9 -0.014  0.012
[10,] 10 -0.021 -0.061
[11,] 11 -0.006  0.064
[12,] 12 -0.008  0.008
[13,] 13  0.031  0.042
[14,] 14  0.008 -0.027
[15,] 15  0.011  0.015
[16,] 16 -0.014 -0.007
[17,] 17 -0.010  0.011
[18,] 18  0.009  0.006
[19,] 19  0.003  0.027
[20,] 20 -0.015  0.026

These are PBMCs before and after treatment, there should be cells that are similar between both conditions, it looks like we’ll have to run some batch effect correction to overlay similar cell-types from both conditions to perform downstream analysis.

Discussion

Do you think we need to integrate our data? Hint: Look at the UMAP and PC1/PC2 plots we made above

What do you think would happen if we were to perform unsupervised clustering right now, without integrating our data (or overlaying similar cells on top of each other from both conditions)?

Step 4: Integrating our data using the harmony method

Seurat v5 has made it really easy to test different integration methods quickly, let’s use a really popular approach (harmony) first.

R

# code adapted from: https://satijalab.org/seurat/articles/seurat5_integration
ifnb.filtered <- IntegrateLayers(object = ifnb.filtered,
                                 method = HarmonyIntegration,
                                 orig.reduction = "pca", 
                                 new.reduction = "harmony")

OUTPUT

The `features` argument is ignored by `HarmonyIntegration`.
Transposing data matrix

Using automatic lambda estimation

Initializing state using k-means centroids initialization

Harmony 1/10

Harmony 2/10

Harmony 3/10

Harmony converged after 3 iterations

This message is displayed once per session.

R

ifnb.filtered <- RunUMAP(ifnb.filtered, reduction = "harmony", dims = 1:20, reduction.name = "umap.harmony")

OUTPUT

00:44:41 UMAP embedding parameters a = 0.9922 b = 1.112
00:44:41 Read 13548 rows and found 20 numeric columns
00:44:41 Using Annoy for neighbor search, n_neighbors = 30
00:44:41 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
00:44:43 Writing NN index file to temp file /tmp/RtmpKAZjB2/file272b67635406
00:44:43 Searching Annoy index using 1 thread, search_k = 3000
00:44:47 Annoy recall = 100%
00:44:48 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
00:44:50 Initializing from normalized Laplacian + noise (using RSpectra)
00:44:51 Commencing optimization for 200 epochs, with 586822 positive edges
00:44:51 Using rng type: pcg
00:44:57 Optimization finished

R

after.harmony <- DimPlot(ifnb.filtered, reduction = "umap.harmony", group.by = "stim") + 
  ggtitle("After Harmony Integration")
before.integration <- DimPlot(ifnb.filtered, reduction = "umap", group.by = "stim") +
  ggtitle("Before Integration")

before.integration | after.harmony

Discussion

Looking at the UMAPs above, do you think integration was successful?

Step 5: Integrating our data using an alternative Seurat CCA method

R

ifnb.filtered <- IntegrateLayers(object = ifnb.filtered,
                                 method = CCAIntegration,
                                 orig.reduction = "pca", 
                                 new.reduction = "integrated.cca")

ifnb.filtered <- RunUMAP(ifnb.filtered, reduction = "integrated.cca", dims = 1:20, reduction.name = "umap.cca")

after.seuratCCA <- DimPlot(ifnb.filtered, reduction = "umap.cca", group.by = "stim") +
  ggtitle("After Seurat CCA Integration")

before.integration | after.seuratCCA

R

after.harmony | after.seuratCCA

R

## Show example slide of integration 'failing' but due to different cell types in each sample ***

Discussion

What do you think of the integration results now?

Hint: Also look at the PC1 and PC2 plots for each integration method.

Challenge

Cell Cycle Check 2 — AFTER integration (after umap.cca + clustering)

Now that we have integrated the data, do you think the results will be the same or different?

Show me the solution

R

# ---- Cell-cycle check (POST-integration) ----

# Visual on integrated embedding
DimPlot(ifnb.filtered, reduction = "umap.cca", group.by = "Phase", pt.size = 0.3)

R

# Phase composition by cluster and by condition
tab_phase_cluster <- prop.table(table(ifnb.filtered$seurat_clusters, ifnb.filtered$Phase), 1) * 100

ERROR

Error in `x[[i, drop = TRUE]]` at SeuratObject/R/seurat.R:2939:3:
! 'seurat_clusters' not found in this Seurat object
 Did you mean "seurat_annotations"?

R

tab_phase_cond    <- prop.table(table(ifnb.filtered$stim,            ifnb.filtered$Phase), 1) * 100
pheatmap(tab_phase_cluster,
         main = "Phase (%) by cluster",
         display_numbers = TRUE,
         number_format = "%.1f")

ERROR

Error: object 'tab_phase_cluster' not found

R

pheatmap(tab_phase_cond,    main = "Phase (%) by condition (stim)")

Step 6: Perform standard clustering steps after integration

This step collapses individual control and treatment datasets together and needs to be done before differential expression analysis

R

ifnb.filtered <- FindNeighbors(ifnb.filtered, reduction = "integrated.cca", dims = 1:20)

OUTPUT

Computing nearest neighbor graph

OUTPUT

Computing SNN

R

ifnb.filtered <- FindClusters(ifnb.filtered, resolution = 0.5)

OUTPUT

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 13548
Number of edges: 521570

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9002
Number of communities: 13
Elapsed time: 2 seconds

R

ifnb.filtered <- JoinLayers(ifnb.filtered)

Additional Challenges

Challenge

You can also use K-means clustering to cluster the data to compare to other clustering methods. How can you use the kmeans() function from stats to cluster the data and visualise it using DimPlot()?

In this example, we used k = 5 purely for illustration. As you can see, it produces fewer clusters compared to the default Louvain algorithm. You are welcome to try different k values in your own time to explore whether k-means clustering is a suitable option in this context.

Show me the solution

R

# K-means
emb <- Embeddings(ifnb.filtered, "pca")[, 1:20]
set.seed(1)
km <- kmeans(emb, centers = 12, nstart = 50)

ifnb.filtered$kmeans_k12 <- factor(km$cluster)

# Compare labelings
p1 <- DimPlot(ifnb.filtered, reduction = "umap.cca", group.by = "seurat_clusters") + ggtitle("Louvain")
p2 <- DimPlot(ifnb.filtered, reduction = "umap.cca", group.by = "kmeans_k12") + ggtitle("k-means (K=12)")
p1 | p2

R

# If you decide to proceed with k-means downstream:
Idents(ifnb.filtered) <- "kmeans_k12"

Key Points

QC filtering removes low-quality cells (e.g., low gene count or high mitochondrial %).
Integration corrects sample-to-sample variation so cells group by biology, not by batch.
Harmony and CCA both align shared cell states but use different mathematical strategies.

Content from Differential Gene Expression when dealing with two treatment conditions

Last updated on 2025-12-02 | Edit this page

Estimated time: 12 minutes

Overview

Questions

How do conserved markers help us label clusters reliably across conditions?
What exactly do avg_log2FC, pct.1, pct.2, and p_val_adj mean in FindMarkers?
Why must DE be run within a cell type (e.g., CD16 Mono_STIM vs CD16 Mono_CTRL) rather than “all cells”?

Objectives

Use FindConservedMarkers() to pick markers and label clusters.
Set identities to annotations and create compound identities (celltype.and.stim) for clean contrasts.
Run FindMarkers() to get DEGs between conditions within a cell type and interpret key columns.
Visualize DEGs (FeaturePlot with split.by, DoHeatmap / pheatmap) and export results for downstream use.
Recognize pitfalls (composition effects, inappropriate contrasts, overly lenient thresholds).

OUTPUT

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 13548
Number of edges: 521570

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9002
Number of communities: 13
Elapsed time: 2 seconds

R

DimPlot(ifnb.filtered, reduction = "umap.cca", label = T)

R

DimPlot(ifnb.filtered, reduction = "umap.cca", group.by = "stim")

Step 1. Find Conserved Markers to label our celltypes

R

## Let's look at conserved markers in cluster 4 across our two conditions (compared to all other clusters)
markers.cluster.4 <- FindConservedMarkers(ifnb.filtered, ident.1 = 4,
                     grouping.var = 'stim')

OUTPUT

Testing group CTRL: (4) vs (0, 11, 2, 7, 1, 5, 10, 9, 3, 6, 8, 12)

OUTPUT

For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
(default method for FindMarkers) please install the presto package
--------------------------------------------
install.packages('devtools')
devtools::install_github('immunogenomics/presto')
--------------------------------------------
After installation of presto, Seurat will automatically use the more
efficient implementation (no further action necessary).
This message will be shown once per session

OUTPUT

Testing group STIM: (4) vs (5, 11, 1, 3, 0, 7, 9, 2, 6, 8, 10, 12)

R

head(markers.cluster.4)

OUTPUT

          CTRL_p_val CTRL_avg_log2FC CTRL_pct.1 CTRL_pct.2 CTRL_p_val_adj
VMO1    0.000000e+00        6.020340      0.843      0.060   0.000000e+00
FCGR3A  0.000000e+00        4.127801      0.980      0.204   0.000000e+00
MS4A7   0.000000e+00        3.734967      0.957      0.196   0.000000e+00
MS4A4A  0.000000e+00        5.200846      0.587      0.025   0.000000e+00
CXCL16 1.976263e-323        2.928346      0.949      0.234  2.777242e-319
LST1   8.070733e-289        2.861452      0.929      0.251  1.134180e-284
       STIM_p_val STIM_avg_log2FC STIM_pct.1 STIM_pct.2 STIM_p_val_adj
VMO1            0        7.585467      0.721      0.022              0
FCGR3A          0        5.121272      0.989      0.128              0
MS4A7           0        3.916774      0.992      0.219              0
MS4A4A          0        4.824831      0.901      0.073              0
CXCL16          0        3.854043      0.924      0.148              0
LST1            0        3.059542      0.887      0.193              0
            max_pval minimump_p_val
VMO1    0.000000e+00              0
FCGR3A  0.000000e+00              0
MS4A7   0.000000e+00              0
MS4A4A  0.000000e+00              0
CXCL16 1.976263e-323              0
LST1   8.070733e-289              0

Let’s visualise the top upregulated, conserved between condition, marker genes identified in cluster 4 using the FeaturePlot() function.

Discussion

Try running the function in the code block below without defining a min.cutoff, or changing the value of the min.cutoff parameter.

R

# Try looking up some of these markers here: https://www.proteinatlas.org/
FeaturePlot(ifnb.filtered, reduction = "umap.cca", 
            features = c('VMO1', 'FCGR3A', 'MS4A7', 'CXCL16'), min.cutoff = 'q10')

R

# I happen to know that the cells in cluster 4 are CD16 monocytes - lets rename this cluster
# Idents(ifnb.filtered) # Let's look at the identities of our cells at the moment

R

ifnb.filtered <- RenameIdents(ifnb.filtered, '4' = 'CD16 Mono') # Let's rename cells in cluster 4 with a new cell type label
# Idents(ifnb.filtered) # we can take a look at the cell identities again

R

DimPlot(ifnb.filtered, reduction = "umap.cca", label = T) +
  ggtitle("After changing the identity of cluster 4")

Step 2: Set the identity of our clusters to the annotations provided

R

Idents(ifnb.filtered) <- ifnb.filtered@meta.data$seurat_annotations
# Idents(ifnb.filtered)

R

DimPlot(ifnb.filtered, reduction = "umap.cca", label = T)

Callout

If you want to perform cell-type identification on your own data when you don’t have a ground-truth, using automatic cell type annotation methods can be a good starting point. This approach is discussed in more detail in the Intro to scRNA-seq workshop material.

Challenge

Automated Cell Type Annotation

Show me the solution

R

# Load reference data 
# Blood & immune lineages
ref.set <- celldex::BlueprintEncodeData()

ifnb.v4 <- JoinLayers(ifnb.filtered)
sce.ifnb.filtered <- as.SingleCellExperiment(ifnb.v4)
sce.ifnb.filtered <- logNormCounts(sce.ifnb.filtered)

pred.cnts <- SingleR(
  test = sce.ifnb.filtered,
  ref = ref.set,
  labels = ref.set$label.main
)

lbls.keep <- table(pred.cnts$labels)>10
# Add SingleR labels to Seurat metadata
ifnb.filtered$SingleR.labels <- sce.ifnb.filtered$SingleR.labels
ifnb.filtered$SingleR.labels <- ifelse(lbls.keep[pred.cnts$labels], pred.cnts$labels, 'Other')

# Run UMAP (based on PCA)
ifnb.filtered <- RunUMAP(ifnb.filtered, dims = 1:20)
DimPlot(ifnb.filtered, reduction='umap.cca', group.by='SingleR.labels',  label = TRUE, label.size = 3 )

Step 3: Find differentially expressed genes (DEGs) between our two conditions, using CD16 Mono cells as an example

R

# Make another column in metadata showing what cells belong to each treatment group (This will make more sense in a bit)
ifnb.filtered$celltype.and.stim <- paste0(ifnb.filtered$seurat_annotations, '_', ifnb.filtered$stim)
# (ifnb.filtered@meta.data)

Idents(ifnb.filtered) <- ifnb.filtered$celltype.and.stim

DimPlot(ifnb.filtered, reduction = "umap.cca", label = T) # each cluster is now made up of two labels (control or stimulated)

R

DimPlot(ifnb.filtered, reduction = "umap.cca", 
        label = T, split.by = "stim") # Lets separate by condition to see what we've done a bit more clearly

We’ll now leverage these new identities to compare DEGs between our treatment groups

R

treatment.response.CD16 <- FindMarkers(ifnb.filtered, ident.1 = 'CD16 Mono_STIM', 
                                       ident.2 = 'CD16 Mono_CTRL')
head(treatment.response.CD16) # These are the genes that are upregulated in the stimulated versus control group

OUTPUT

               p_val avg_log2FC pct.1 pct.2     p_val_adj
IFIT1  1.379187e-176   5.834216 1.000 0.094 1.938172e-172
ISG15  6.273887e-166   5.333771 1.000 0.478 8.816694e-162
IFIT3  1.413978e-164   4.412990 0.992 0.314 1.987063e-160
ISG20  6.983755e-164   4.088510 1.000 0.448 9.814270e-160
IFITM3 1.056793e-161   3.191513 1.000 0.634 1.485111e-157
IFIT2  7.334976e-159   4.622453 0.974 0.162 1.030784e-154

Step 4: Lets plot conserved features vs DEGs between conditions

R

FeaturePlot(ifnb.filtered, reduction = 'umap.cca', 
            features = c('VMO1', 'FCGR3A', 'IFIT1', 'ISG15'),
            split.by = 'stim', min.cutoff = 'q10')

Step 5: Create a Heatmap to visualise DEGs between our two conditions + cell types

R

# Find upregulated genes in each group (cell type and condition)
ifnb.treatVsCtrl.markers <- FindAllMarkers(ifnb.filtered,
                                          only.pos = TRUE)

R

saveRDS(ifnb.treatVsCtrl.markers, "ifnb_stimVsCtrl_markers.rds")

If the top code block takes too long to run - you can download the rds file of the output using the code below:

Seurat’s in-built heatmap function can be quite messy and hard to interpret sometimes (we’ll learn how to make better and clearer custom heatmaps using the pheatmap package from our Seurat expression data later on).

R

ifnb.treatVsCtrl.markers <- readRDS(url("https://github.com/manveerchauhan/Seurat_DE_Workshop/raw/refs/heads/main/ifnb_stimVsCtrl_markers.rds"))

top5 <- ifnb.treatVsCtrl.markers %>%
  group_by(cluster) %>%
  dplyr::filter(avg_log2FC > 1) %>%
  slice_head(n = 5) %>%
  ungroup()

DEG.heatmap <- DoHeatmap(ifnb.filtered, features = top5$gene,
          label = FALSE)

DEG.heatmap

Key Points

QC filtering removes low-quality cells (e.g., low gene count or high mitochondrial %).
Integration corrects sample-to-sample variation so cells group by biology, not by batch.
Harmony and CCA both align shared cell states but use different mathematical strategies.

Content from Differential Expression using a pseudobulk approach and DESeq2

Last updated on 2025-12-02 | Edit this page

Estimated time: 12 minutes

Overview

Questions

Why aggregate to pseudobulk instead of testing each cell?
How does donor act as the biological replicate in DESeq2?
When do pseudobulk and single-cell DE agree or disagree, and why (e.g., zero inflation, composition, variance models)?
What minimum sampling per group (cells per donor, donors per group) is sensible before aggregating?

Objectives

Attach donor IDs to cells and create pseudobulk matrices with AggregateExpression().
Run DESeq2 from Seurat via FindMarkers(test.use=“DESeq2”) on pseudobulk objects.
Inspect and interpret key outputs (adjusted p-values, log2FC) and compare to single-cell DE.
Visualize consensus/unique DEGs and make a publication-ready heatmap with pheatmap.
Recognize pitfalls (uneven donor coverage, tiny groups, composition shifts) and report limitations.

OUTPUT

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 13548
Number of edges: 521570

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9002
Number of communities: 13
Elapsed time: 1 seconds

Step 1: We need to import sample information for each cell from the original paper

Discussion

Have a look at the ifnb.filtered seurat metadata, can you spot what we’ve done here?

R

# defining a function here to retrieve that information (code from https://satijalab.org/seurat/articles/de_vignette)
loadDonorMetadata <- function(seu.obj){
  # load the inferred sample IDs of each cell
  ctrl <- read.table(url("https://raw.githubusercontent.com/yelabucsf/demuxlet_paper_code/master/fig3/ye1.ctrl.8.10.sm.best"), head = T, stringsAsFactors = F)
  stim <- read.table(url("https://raw.githubusercontent.com/yelabucsf/demuxlet_paper_code/master/fig3/ye2.stim.8.10.sm.best"), head = T, stringsAsFactors = F)
  info <- rbind(ctrl, stim)
  
  # rename the cell IDs by substituting the '-' into '.'
  info$BARCODE <- gsub(pattern = "\\-", replacement = "\\.", info$BARCODE)
  
  # only keep the cells with high-confidence sample ID
  info <- info[grep(pattern = "SNG", x = info$BEST), ]
  
  # remove cells with duplicated IDs in both ctrl and stim groups
  info <- info[!duplicated(info$BARCODE) & !duplicated(info$BARCODE, fromLast = T), ]
  
  # now add the sample IDs to ifnb 
  rownames(info) <- info$BARCODE
  info <- info[, c("BEST"), drop = F]
  names(info) <- c("donor_id")
  seu.obj <- AddMetaData(seu.obj, metadata = info)
  
  # remove cells without donor IDs
  seu.obj$donor_id[is.na(seu.obj$donor_id)] <- "unknown"
  seu.obj <- subset(seu.obj, subset = donor_id != "unknown")
  
  return(seu.obj)
}

ifnb.filtered <- loadDonorMetadata(ifnb.filtered)
#ifnb.filtered@meta.data

Step 2: Aggregate our counts based on treatment group, cell-type, and donor id

Collapsing our single-cell matrix in this manner is referred to as creating a ‘pseudo-bulk’ count matrix.. We’re making a condensed count matrix that looks more like a bulk matrix so that we can use bulk differential expression algorithms like DESeq2. We can see this clearly when we have a look at the ifnb.pseudbulk.df we make in the following code block.

R

ifnb.pseudobulk <- AggregateExpression(ifnb.filtered, assays = "RNA",
                                   group.by = c("stim", "donor_id", "seurat_annotations"),
                                   return.seurat = TRUE)

OUTPUT

Centering and scaling data matrix

R

## Centering and scaling data matrix
# If you want the pseudobulk matrix as a dataframe you can do this:
ifnb.pseudobulk.df <- AggregateExpression(ifnb.filtered, assays = "RNA",
                                          group.by = c("stim", "donor_id", "seurat_annotations")) %>% 
  as.data.frame()

We can view the top rows here:

R

head(ifnb.pseudobulk.df)

Step 3: Perform Differential Expression using DESeq2

Just like before, lets make a new column containing the cell type and treatment group before DE with DESeq2. We do this so that we can leverage the Idents() function combined with the FindMarkers function.

To use DESeq2 in older versions of Seurat (prior to v5) we would have to perform a myriad of convoluted data wrangling steps to convert our Seurat pseudobulk matrix into a format compatible with DESeq2 (or whatever bulk DE package being used). Fortunately in Seurat v5, these different bulk DE algorithms/packages have been wrapped into the FindMarker() functions - simplifying pseudobulk DE approaches immensely.

It is worth taking a look at the Seurat DE vignette on your own to see the other pseudobulk DE methods you can use. Do keep in mind that to use some of the DE tests, you need to install that package and load it as a library separately to Seurat, as we’ve done here with DESeq2 at the start.

R

ifnb.pseudobulk$celltype.and.stim <- paste(ifnb.pseudobulk$seurat_annotations, ifnb.pseudobulk$stim, sep = "_")
Idents(ifnb.pseudobulk) <- "celltype.and.stim"


# Lets run a DEG test between treated and control CD 16 monocytes using the same FindMarkers function but with DESeq2
treatment.response.CD16.pseudo <- FindMarkers(object = ifnb.pseudobulk, 
                                      ident.1 = 'CD16 Mono_STIM', 
                                      ident.2 = 'CD16 Mono_CTRL',
                                      test.use = "DESeq2")

OUTPUT

converting counts to integer mode

OUTPUT

gene-wise dispersion estimates

OUTPUT

mean-dispersion relationship

OUTPUT

final dispersion estimates

R

head(treatment.response.CD16.pseudo)

OUTPUT

               p_val avg_log2FC pct.1 pct.2     p_val_adj
IFIT3  1.564083e-134   4.601427     1     1 2.198006e-130
IFIT2   2.122691e-84   4.613021     1     1  2.983017e-80
ISG20   1.401656e-81   4.038272     1     1  1.969747e-77
DDX58   1.366535e-73   3.448721     1     1  1.920392e-69
NT5C3A  5.127048e-66   3.942571     1     1  7.205040e-62
OASL    1.186412e-63   4.025025     1     1  1.667265e-59

R

# How are we able to use the same findmarkers function here?
head(Cells(ifnb.pseudobulk)) # our 'cells' are no longer barcodes, but have been renamed according to stim-donor-annotation when we aggregated our data earlier

OUTPUT

[1] "CTRL_SNG-101_CD14 Mono"    "CTRL_SNG-101_CD4 Naive T"
[3] "CTRL_SNG-101_CD4 Memory T" "CTRL_SNG-101_CD16 Mono"
[5] "CTRL_SNG-101_B"            "CTRL_SNG-101_CD8 T"

Step 4: Assessing differences between our pseudbulk DEGs and single-cell DEGs

First lets take a look at the DEG dataframes we made for both.

Discussion

Having a look at them, can you identify any differences? Can you think of any underlying reasons behind those differences?

Hint: Look at the p_val and p_val_adj columns.

R

head(treatment.response.CD16)

OUTPUT

               p_val avg_log2FC pct.1 pct.2     p_val_adj
IFIT1  1.379187e-176   5.834216 1.000 0.094 1.938172e-172
ISG15  6.273887e-166   5.333771 1.000 0.478 8.816694e-162
IFIT3  1.413978e-164   4.412990 0.992 0.314 1.987063e-160
ISG20  6.983755e-164   4.088510 1.000 0.448 9.814270e-160
IFITM3 1.056793e-161   3.191513 1.000 0.634 1.485111e-157
IFIT2  7.334976e-159   4.622453 0.974 0.162 1.030784e-154

R

head(treatment.response.CD16.pseudo)

OUTPUT

               p_val avg_log2FC pct.1 pct.2     p_val_adj
IFIT3  1.564083e-134   4.601427     1     1 2.198006e-130
IFIT2   2.122691e-84   4.613021     1     1  2.983017e-80
ISG20   1.401656e-81   4.038272     1     1  1.969747e-77
DDX58   1.366535e-73   3.448721     1     1  1.920392e-69
NT5C3A  5.127048e-66   3.942571     1     1  7.205040e-62
OASL    1.186412e-63   4.025025     1     1  1.667265e-59

Next let’s take a look at the degree of overlap between the actual DEGs in both approaches.

To help with this, I’m just going to define a couple helper functions below.

R

Merge_DEG_dataframes <- function(pseudobulk.de,
                                 singlecell.de){
  names(pseudobulk.de) <- paste0(names(pseudobulk.de), ".bulk")
  pseudobulk.de$gene <- rownames(pseudobulk.de)
  
  names(singlecell.de) <- paste0(names(singlecell.de), ".sc")
  singlecell.de$gene <- rownames(singlecell.de)
  
  merge_dat <- merge(singlecell.de, pseudobulk.de, by = "gene")
  merge_dat <- merge_dat[order(merge_dat$p_val.bulk), ]
  
  return(merge_dat)
}
Visualise_Overlapping_DEGs <- function(pseudobulk.de,
                                       singlecell.de){
  
  merge_dat <- Merge_DEG_dataframes(pseudobulk.de,
                                    singlecell.de)
  
  # Number of genes that are marginally significant in both; marginally significant only in bulk; and marginally     significant only in single-cell
  common <- merge_dat$gene[which(merge_dat$p_val_adj.bulk < 0.05 & 
                                   merge_dat$p_val_adj.sc < 0.05)]
  only_sc <- merge_dat$gene[which(merge_dat$p_val_adj.bulk > 0.05 & 
                                    merge_dat$p_val_adj.sc < 0.05)]
  only_pseudobulk <- merge_dat$gene[which(merge_dat$p_val_adj.bulk < 0.05 & 
                                      merge_dat$p_val_adj.sc > 0.05)]
  
  # Create a dataframe to plot overlaps
  overlap.info <- data.frame(
    category = c("Common", 
                 "Only in single-cell", 
                 "Only in pseudobulk"),
    count = c(length(common), length(only_sc), length(only_pseudobulk))
  )
  
  overlap.info$category <- factor(overlap.info$category, 
                                  levels = c("Common", 
                                             "Only in single-cell", 
                                             "Only in pseudobulk"))
  # Create the bar plot
  overlap.bar.plt <- ggplot(overlap.info, aes(x = category, y = count, fill = category)) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = count), vjust = -0.5, size = 4) +
    theme_minimal() +
    labs(title = "Number of Overlapping and Unique DEGs from Single-Cell and Pseudobulk Tests",
         x = "",
         y = "Number of Genes") +
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 45, hjust = 1))
  
  return(overlap.bar.plt)
}

Let’s use the helper functions above to plot and view the overlap/agreement between our pseudobulk versus single-cell approaches

R

merged_deg_data <- Merge_DEG_dataframes(pseudobulk.de = treatment.response.CD16.pseudo,
                                        singlecell.de = treatment.response.CD16)
merged_deg_data %>% 
  dplyr::select(gene, 
                p_val.sc, p_val.bulk,
                p_val_adj.sc, p_val_adj.bulk) %>%
  head(10)

OUTPUT

       gene      p_val.sc    p_val.bulk  p_val_adj.sc p_val_adj.bulk
2866  IFIT3 1.413978e-164 1.564083e-134 1.987063e-160  2.198006e-130
2865  IFIT2 7.334976e-159  2.122691e-84 1.030784e-154   2.983017e-80
2992  ISG20 6.983755e-164  1.401656e-81 9.814270e-160   1.969747e-77
1626  DDX58 2.340153e-109  1.366535e-73 3.288617e-105   1.920392e-69
4129 NT5C3A 5.806396e-117  5.127048e-66 8.159728e-113   7.205040e-62
4185   OASL 5.497910e-147  1.186412e-63 7.726213e-143   1.667265e-59
4544 PLSCR1 6.905174e-116  3.783863e-61 9.703841e-112   5.317463e-57
3859 MYL12A  2.476988e-83  7.623767e-61  3.480912e-79   1.071368e-56
2859  IFI35  4.701828e-99  2.527378e-58  6.607479e-95   3.551724e-54
2661  HERC5  1.578848e-92  1.829060e-56  2.218755e-88   2.570378e-52

R

# How many DEGs overlap between our two methods? Is there anything in the merged_deg_data frame that stands out to you?
overlap.bar.plt <- Visualise_Overlapping_DEGs(pseudobulk.de = treatment.response.CD16.pseudo,
                                              singlecell.de = treatment.response.CD16)

overlap.bar.plt

Discussion

Can you explain why we’re seeing the discrepancies we see between the two methods here?

Step 5: Investigate the differences between pseudobulk DE and single-cell DE closer

Let’s create lists of genes in each of our categories first

R

common <- merged_deg_data$gene[which(merged_deg_data$p_val.bulk < 0.05 & 
                                       merged_deg_data$p_val.sc < 0.05)]
only_sc <- merged_deg_data$gene[which(merged_deg_data$p_val.bulk > 0.05 & 
                                        merged_deg_data$p_val.sc < 0.05)]
only_pseudobulk <- merged_deg_data$gene[which(merged_deg_data$p_val.bulk < 0.05 & 
                                          merged_deg_data$p_val.sc > 0.05)]

Now I want to look at the expression of genes that only appear in our sc deg test and not in our pseudobulk deg test I’ve picked two genes from the ‘only_sc’ variable we just defined

R

# create a new column to annotate sample-condition-celltype in the single-cell dataset
ifnb.filtered$donor_id.and.stim <- paste0(ifnb.filtered$stim, "-", ifnb.filtered$donor_id)
Idents(ifnb.filtered) <- "celltype.and.stim"

# Explore some genes that only appear in the sc deg test---
VlnPlot(ifnb.filtered, features = c("PABPC1", "SRGN"), 
        idents = c("CD16 Mono_CTRL", "CD16 Mono_STIM"), 
        group.by = "stim")

R

VlnPlot(ifnb.filtered, features = c("PABPC1", "SRGN"), 
        idents = c("CD16 Mono_CTRL", "CD16 Mono_STIM"), 
        group.by = "donor_id.and.stim", ncol = 1)

Discussion

How would you interpret the plots above? What does this tell you about some of the pitfalls of single-cell DE approaches?

Now lets take a look at some genes that show agreement across both sc and pseudobulk deg tests

R

VlnPlot(ifnb.filtered, features = c("IFIT2", "PSMA4"), 
        idents = c("CD16 Mono_CTRL", "CD16 Mono_STIM"), 
        group.by = "stim")

R

VlnPlot(ifnb.filtered, features = c("IFIT2", "PSMA4"), 
        idents = c("CD16 Mono_CTRL", "CD16 Mono_STIM"), 
        group.by = "donor_id.and.stim", ncol = 1)

Step 6: Creating our own custom visualisations for DEG analysis between cell-types in two different experimental groups

So far, we’ve been using functions wrapped within Seurat to plot and visualise our data. But we’re not limited to those functions. Let’s use the pheatmap (‘pretty heatmap’) package to visualise our DEGs with adjusted p values < 0.05.

First lets look at our significant DEGs as defined by our pseudobulk approach

R

CD16.sig.markers <- treatment.response.CD16.pseudo %>% 
  dplyr::filter(p_val_adj < 0.05) %>%
  dplyr::mutate(gene = rownames(.))

This is how we can pull our average (scaled) pseudobulk expression values from our seurat obj:

R

ifnb.filtered$celltype.stim.donor_id <- paste0(ifnb.filtered$seurat_annotations, "-",
                                               ifnb.filtered$stim, "-", ifnb.filtered$donor_id)
Idents(ifnb.filtered) <- "celltype.stim.donor_id"

all.sig.avg.Expression.mat <- AverageExpression(ifnb.filtered, 
                         features = CD16.sig.markers$gene, 
                         layer = 'scale.data')

OUTPUT

As of Seurat v5, we recommend using AggregateExpression to perform pseudo-bulk analysis.
This message is displayed once per session.

R

## As of Seurat v5, we recommend using AggregateExpression to perform pseudo-bulk analysis.
## This message is displayed once per session.

# View(all.sig.avg.Expression.mat %>%
#  as.data.frame())

Now lets make sure we’re only using data from the CD16 cell type

R

CD16.sig.avg.Expression.mat <- all.sig.avg.Expression.mat$RNA %>%
  as.data.frame() %>%
  dplyr::select(starts_with("CD16 Mono"))

# View(CD16.sig.avg.Expression.mat)

Let’s finally view our heatmap with averaged pseudobulk scaled expression values for our signficant DEGs

R

pheatmap::pheatmap(CD16.sig.avg.Expression.mat,
         cluster_rows = TRUE,
         show_rownames = FALSE, 
         border_color = NA, 
         fontsize = 10, 
         scale = "row", 
         fontsize_row = 10, 
         height = 20)

The cool thing about pheatmap is that we can customise our heatmap with added metadata

R

cluster_metadata <- data.frame(
  row.names = colnames(CD16.sig.avg.Expression.mat)
) %>% 
  dplyr::mutate(
    Cell_Type = "CD16 Mono",
    Treatment_Group = ifelse(str_detect(row.names(.), "STIM|CTRL"), 
                      str_extract(row.names(.), "STIM|CTRL")))

sig.DEG.heatmap <- pheatmap::pheatmap(CD16.sig.avg.Expression.mat,
         cluster_rows = TRUE,
         show_rownames = FALSE,
         annotation = cluster_metadata[, c("Treatment_Group", "Cell_Type")], 
         border_color = NA, 
         fontsize = 10, 
         scale = "row", 
         fontsize_row = 10, 
         height = 20,
         annotation_names_col = FALSE)

sig.DEG.heatmap

Lets save our heatmap using the grid library (very useful package!) Code from: https://stackoverflow.com/questions/43051525/how-to-draw-pheatmap-plot-to-screen-and-also-save-to-file

R

save_pheatmap_pdf <- function(x, filename, width=7, height=7) {
   stopifnot(!missing(x))
   stopifnot(!missing(filename))
   pdf(filename, width=width, height=height)
   grid::grid.newpage()
   grid::grid.draw(x$gtable)
   dev.off()
}

save_pheatmap_pdf(sig.DEG.heatmap, "sig_DEG_pseudo.pdf")

Challenge

Do the same thing with significant DEGs from the sc approach. Do you see any differences?

Show me the solution

Run the same code as above but with the treatment.response.CD16 variable instead of treatment.response.CD16.pseudo variable.

R

CD16.sig.markers <- treatment.response.CD16 %>% 
  dplyr::filter(p_val_adj < 0.05) %>%
  dplyr::mutate(gene = rownames(.))

ifnb.filtered$celltype.stim.donor_id <- paste0(ifnb.filtered$seurat_annotations, "-",
                                               ifnb.filtered$stim, "-", ifnb.filtered$donor_id)
Idents(ifnb.filtered) <- "celltype.stim.donor_id"

all.sig.avg.Expression.mat <- AverageExpression(ifnb.filtered, 
                         features = CD16.sig.markers$gene, 
                         layer = 'scale.data')

## As of Seurat v5, we recommend using AggregateExpression to perform pseudo-bulk analysis.
## This message is displayed once per session.

# View(all.sig.avg.Expression.mat %>%
#  as.data.frame())

CD16.sig.avg.Expression.mat <- all.sig.avg.Expression.mat$RNA %>%
  as.data.frame() %>%
  dplyr::select(starts_with("CD16 Mono"))

# View(CD16.sig.avg.Expression.mat)

pheatmap::pheatmap(CD16.sig.avg.Expression.mat,
         cluster_rows = TRUE,
         show_rownames = FALSE, 
         border_color = NA, 
         fontsize = 10, 
         scale = "row", 
         fontsize_row = 10, 
         height = 20)

R

cluster_metadata <- data.frame(
  row.names = colnames(CD16.sig.avg.Expression.mat)
) %>% 
  dplyr::mutate(
    Cell_Type = "CD16 Mono",
    Treatment_Group = ifelse(str_detect(row.names(.), "STIM|CTRL"), 
                      str_extract(row.names(.), "STIM|CTRL")))

sig.DEG.heatmap <- pheatmap::pheatmap(CD16.sig.avg.Expression.mat,
         cluster_rows = TRUE,
         show_rownames = FALSE,
         annotation = cluster_metadata[, c("Treatment_Group", "Cell_Type")], 
         border_color = NA, 
         fontsize = 10, 
         scale = "row", 
         fontsize_row = 10, 
         height = 20,
         annotation_names_col = FALSE)

sig.DEG.heatmap

You have now completed the workshop!

Hopefully you’ve found this workshop useful and you’re now feeling more confident about using different integration and differential expression approaches in single-cell data analysis.

Discussion

What kind of analyses do you think we can do next after obtaining a list of differentially expressed genes (DEGs) for each cell type and treatment groups?

Key Points

Collapse single-cell counts to pseudobulk per donor × condition × cell type.
Compare single-cell DE vs pseudobulk DE.