Setup, Quality Control and Sample Integration

Last updated on 2025-12-02 | Edit this page

Estimated time: 12 minutes

Overview

Questions

How do we identify and remove low-quality cells in scRNA-seq data?
What signs suggest batch effects between treatment conditions?
When and why do we need to integrate datasets before downstream analysis?
How do Harmony and Seurat CCA compare in aligning similar cell types?

Objectives

Load and inspect a Seurat single-cell dataset (e.g., ifnb).
Perform basic quality control and filtering using mitochondrial content and feature counts.
Visualise data distributions with violin plots and scatterplots.
Recognise when integration is needed and apply both Harmony and CCA integration methods.
Perform initial clustering and visualise condition alignment in UMAP space.

Step 1. Load the packages and data

Today we’ll be working with Seurat (a popular scRNA-seq analysis package). SeuratData will be used to load in the experimental data we’re analysing. Tidyverse is a fundamental and very popularly used set of tools to wrangle and visualise data.

We’ll need to load the DESeq2 R package for when we explore pseudobulk DE approaches

pheatmap and grid are two really useful packages for creating custom heatmaps with our scRNA-seq data and exporting figures, respectively.

R

library(Seurat)
library(SeuratData)
library(tidyverse)
library(DESeq2)
library(patchwork)
library(pheatmap)
library(grid)
library(metap)
library(harmony)
library(DropletUtils)
library(ggplot2)
library(SingleR)
library(Celldex)

set.seed(4242) # Set Seed for Reproducibility

We’re using the ifnb public dataset provided by Seurat. This dataset contains PBMC data from 8 lupus patients before and after interferon beta treatment.

I strongly encourage you to explore the other datasets offered by the SeuratData package, it can be really good practice in your spare time.

The ifnb Seurat object we’re loading in here was originally made in Seurat v4, there have since been a lot of changes from Seurat v4 to v5 so we’ll use the UpdateSeuratObject() function to update the Seurat object so that it is compatible for today.

R

head(AvailableData()) # if you want to see the available SeuratData datasets use this function

R

InstallData("ifnb") # install our treatment vs control dataset for today

data("ifnb") # Load the dataset into our current R script
ifnb <- UpdateSeuratObject(ifnb) # Make sure the seurat object is in the format of Seurat v5

str(ifnb) # we can use this to take a look at the information in our Seurat Object

Challenge

Looking at the output from the str() function above, can you tell whether this seurat object is processed or unprocessed?

Show me the solution

When loading in seurat objects, we can have a look at what processing steps have been performed on it by using the str() function. In the output we can tell that the ifnb Seurat object is unprocessed because the scale.data slot is empty, no variable features have been identified, and no dimensionality reduction functions have been performed.

Step 2: Run QC, filter out low quality cells

Lets start by processing our data (run the standard seurat workflow steps including preprocessing and filtering).

First we need to take a look at QC metrics, then decide on the thresholds for filtering.

Challenge

QC for droplet-based protocols

Show me the solution

In droplet-based protocols (e.g., 10x Genomics), millions of droplets are formed, but only some droplets contain exactly one real cell. Several types of “bad droplets” appear:

Empty droplets (no real cell)
Doublets (two cells in one droplet)

R

# Step 2a: QC and filtering
ifnb$percent.mt <- PercentageFeatureSet(object = ifnb, pattern = "^MT-") # First let's annotate the mitochondrial percentage for each cell

head((ifnb@meta.data)) # we can take a look mitochondrial percentages for the seurat object by viewing the seurat objects metadata

OUTPUT

                  orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono
AAACATACCTCGCT.1 IMMUNE_CTRL       3420          850 CTRL          CD14 Mono
AAACATACCTGGTA.1 IMMUNE_CTRL       3156         1109 CTRL                pDC
AAACATACGATGAA.1 IMMUNE_CTRL       1868          634 CTRL       CD4 Memory T
AAACATACGGCATT.1 IMMUNE_CTRL       1581          557 CTRL          CD14 Mono
                 percent.mt
AAACATACATTTCC.1          0
AAACATACCAGAAA.1          0
AAACATACCTCGCT.1          0
AAACATACCTGGTA.1          0
AAACATACGATGAA.1          0
AAACATACGGCATT.1          0

R

# Step 2b: Visualise QC metrics and identify filtering thresholds
qc.metric.plts <- VlnPlot(ifnb, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) + 
  ggtitle("Before Filtering")

WARNING

Warning in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents =
idents, : All cells have the same value of percent.mt.

R

association.plt.raw <- FeatureScatter(ifnb, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = "lm") +
  ggtitle("Before Filtering")

qc.metric.plts

R

association.plt.raw

OUTPUT

`geom_smooth()` using formula = 'y ~ x'

Discussion

Looking at the violin plots of QC metrics, what do you think about the overall quality of the ifnb dataset?

After visualising QC metrics, we’ll move on to the actual filtering

R

# Step 2c: filter out low-quality cells + visualise the metrics for our filtered seurat object
ifnb.filtered <- subset(ifnb, subset = nCount_RNA > 800 & 
                          nCount_RNA < 5000 &
                          nFeature_RNA > 200 &
                          nFeature_RNA < 1200 &
                          percent.mt < 5)

qc.metric.plts.filtered <- VlnPlot(ifnb.filtered, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) +
  ggtitle("After Filtering")
association.plt.filtered <- FeatureScatter(ifnb.filtered, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = "lm") +
  ggtitle("After Filtering")

qc.metric.plts.filtered

R

association.plt.filtered

Let’s check how many cells we’ve filtered out (looks like ~400 cells were removed):

R

## Defining a couple helper functions to standardise x and y axis for two plots
get_plot_range <- function(plot) {
  data <- layer_data(plot)
  list(
    x = range(data$x, na.rm = TRUE),
    y = range(data$y, na.rm = TRUE)
  )
}
standardise_plt_scale <- function(plt1, plt2){

  # Get ranges for both plots
  range_raw <- get_plot_range(plt1)
  range_filtered <- get_plot_range(plt2)
  
  # Calculate overall range
  x_range <- range(c(range_raw$x, range_filtered$x))
  y_range <- range(c(range_raw$y, range_filtered$y))
  
  suppressMessages({
  # Update both plots with the same x and y scales
  association.plt.raw <- association.plt.raw +
    scale_x_continuous(limits = x_range) +
    scale_y_continuous(limits = y_range)
  
  association.plt.filtered <- association.plt.filtered +
    scale_x_continuous(limits = x_range) +
    scale_y_continuous(limits = y_range)
  })
  
  # Wrap the plots
  wrapped_plots <- wrap_plots(list(association.plt.raw, association.plt.filtered), 
                              ncol = 2)

  return(wrapped_plots)
}

wrap_plots(list(qc.metric.plts, qc.metric.plts.filtered), 
           ncol = 1)

R

association.plts <- standardise_plt_scale(association.plt.raw,
                                          association.plt.filtered)
association.plts

Let’s check how many cells we’ve filtered out (looks like ~400 cells were removed):

R

ifnb

OUTPUT

An object of class Seurat
14053 features across 13999 samples within 1 assay
Active assay: RNA (14053 features, 0 variable features)
 2 layers present: counts, data

R

ifnb.filtered

OUTPUT

An object of class Seurat
14053 features across 13548 samples within 1 assay
Active assay: RNA (14053 features, 0 variable features)
 2 layers present: counts, data

Next we need to split our count matrices based on conditions. This step stores stimulated versus unstimulated expression information separately, creating a list of RNA assays grouped by the “stim” condition. Note: this is important for downstream integration steps in Seurat v5.

R

ifnb.filtered[["RNA"]] <- split(ifnb.filtered[["RNA"]], f = ifnb.filtered$stim) # Lets split our count matrices based on conditions (stored within different layers) -> needed for integration steps in Seurat v5

Step 3: Before performing differential expression between the two conditions, let’s assess whether we need to integrate our data

After filtering out low quality cells, we want to visualise our data to see how cells group by condition and if we need to perform batch-effect correction (integration)

R

ifnb.filtered <- NormalizeData(ifnb.filtered)

OUTPUT

Normalizing layer: counts.CTRL

OUTPUT

Normalizing layer: counts.STIM

R

ifnb.filtered <- FindVariableFeatures(ifnb.filtered)

OUTPUT

Finding variable features for layer counts.CTRL

OUTPUT

Finding variable features for layer counts.STIM

R

ifnb.filtered <- ScaleData(ifnb.filtered)

OUTPUT

Centering and scaling data matrix

R

## Centering and scaling data matrix

ifnb.filtered <- RunPCA(ifnb.filtered)

OUTPUT

PC_ 1
Positive:  TYROBP, C15orf48, FCER1G, CST3, SOD2, ANXA5, FTL, TYMP, TIMP1, CD63
	   LGALS1, CTSB, S100A4, KYNU, LGALS3, FCN1, PSAP, NPC2, ANXA2, IGSF6
	   S100A11, LYZ, SPI1, APOBEC3A, CD68, CTSL, NINJ1, HLA-DRA, CCL2, SDCBP
Negative:  NPM1, CCR7, CXCR4, GIMAP7, LTB, CD3D, CD7, SELL, TMSB4X, CD2
	   TRAT1, IL7R, PTPRCAP, IL32, ITM2A, RGCC, LEF1, CD3G, ALOX5AP, CREM
	   PASK, MYC, SNHG8, TSC22D3, BIRC3, GPR171, NOP58, CD27, RARRES3, CD8B
PC_ 2
Positive:  ISG15, ISG20, IFIT3, IFIT1, LY6E, TNFSF10, IFIT2, MX1, IFI6, RSAD2
	   CXCL10, OAS1, CXCL11, IFITM3, MT2A, OASL, TNFSF13B, IDO1, IL1RN, APOBEC3A
	   CCL8, GBP1, HERC5, FAM26F, GBP4, RABGAP1L, HES4, WARS, VAMP5, DEFB1
Negative:  IL8, CLEC5A, CD14, VCAN, S100A8, IER3, MARCKSL1, IL1B, PID1, CD9
	   GPX1, INSIG1, PHLDA1, PLAUR, PPIF, THBS1, OSM, SLC7A11, CTB-61M7.2, GAPDH
	   LIMS1, S100A9, GAPT, ACTB, CXCL3, C19orf59, MGST1, OLR1, CEBPB, FTH1
PC_ 3
Positive:  HLA-DQA1, CD83, HLA-DQB1, CD74, HLA-DRA, HLA-DPA1, HLA-DRB1, CD79A, HLA-DPB1, IRF8
	   MS4A1, SYNGR2, MIR155HG, HERPUD1, REL, HSP90AB1, ID3, HLA-DMA, TVP23A, FABP5
	   NME1, HSPE1, PMAIP1, BANK1, CD70, HSPD1, TSPAN13, EBI3, TCF4, CCR7
Negative:  ANXA1, GNLY, NKG7, GIMAP7, TMSB4X, PRF1, CD7, CCL5, RARRES3, CD3D
	   CD2, KLRD1, GZMH, GZMA, CTSW, GZMB, FGFBP2, CLIC3, IL32, MT2A
	   FASLG, KLRC1, CST7, RGCC, CD8A, GCHFR, OASL, GZMM, CXCR3, KLRB1
PC_ 4
Positive:  LTB, SELL, CCR7, LEF1, IL7R, CD3D, TRAT1, GIMAP7, ADTRP, PASK
	   CD3G, TSHZ2, CMTM8, SOCS3, TSC22D3, NPM1, CCL2, MYC, CCL7, CCL8
	   CTSL, SNHG8, TXNIP, CD27, S100A9, CA6, C12orf57, TMEM204, HPSE, GPR171
Negative:  NKG7, GZMB, GNLY, CST7, PRF1, CCL5, CLIC3, KLRD1, APOBEC3G, GZMH
	   GZMA, CTSW, FGFBP2, KLRC1, FASLG, C1orf21, HOPX, SH2D1B, TNFRSF18, CXCR3
	   LINC00996, SPON2, RAMP1, ID2, GCHFR, IGFBP7, HLA-DPA1, CD74, XCL2, HLA-DPB1
PC_ 5
Positive:  CCL2, CCL7, CCL8, PLA2G7, TXN, LMNA, SDS, S100A9, CSTB, ATP6V1F
	   CAPG, CCR1, EMP1, FABP5, CCR5, IDO1, TPM4, LILRB4, MGST1, CTSB
	   HPSE, CCNA1, GCLM, PDE4DIP, HSPA1A, CD63, SLC7A11, HSPA5, VIM, HSP90B1
Negative:  VMO1, FCGR3A, MS4A4A, CXCL16, MS4A7, PPM1N, HN1, LST1, SMPDL3A, ATP1B3
	   CASP5, CDKN1C, AIF1, CH25H, PLAC8, SERPINA1, TMSB4X, LRRC25, CD86, GBP5
	   HCAR3, RP11-290F20.3, COTL1, RGS19, VNN2, PILRA, STXBP2, LILRA5, C3AR1, FCGR3B

R

ElbowPlot(ifnb.filtered) # Visualise the dimensionality of the data, looks like 15 PCs is adequate to capture the majority of the variation in the data, but we'll air on the higher side and consider all 20 dimensions.

R

DimPlot(ifnb.filtered, reduction = 'pca', group.by = 'stim') # lets see how our cells separate by condition and whether integration is necessary

Challenge

Cell Cycle Check 1 — BEFORE integration (after PCA / pre-Harmony / CCA)

Seurat also features a function, CellCyleScoring to calculate which phase each individual cell is in the cell cycle using canonical markers. You can read more about it here.

Which phase in the cell cycle are the clusters in primarily? Are they different or the same between clusters?

Show me the solution

R

# ---- Cell-cycle check (PRE-integration) ----
if (!exists("cc.genes.updated.2019")) data("cc.genes.updated.2019", package = "Seurat")

# Score S/G2M
ifnb.filtered <- CellCycleScoring(
  ifnb.filtered,
  s.features   = cc.genes.updated.2019$s.genes,
  g2m.features = cc.genes.updated.2019$g2m.genes,
  set.ident    = FALSE,
  search       = TRUE
)

# Quick UMAP on PCA (if you haven't already run it)
if (!"umap" %in% Reductions(ifnb.filtered)) {
  ifnb.filtered <- RunUMAP(ifnb.filtered, dims = 1:20, reduction = "pca")
}

# Visual + quick quant
DimPlot(ifnb.filtered, reduction = "umap", group.by = "Phase", pt.size = 0.3)

R

emb_pca <- Embeddings(ifnb.filtered, "pca")[,1:20]
pc_cor_S   <- sapply(1:20, \(i) cor(emb_pca[,i], ifnb.filtered$S.Score))
pc_cor_G2M <- sapply(1:20, \(i) cor(emb_pca[,i], ifnb.filtered$G2M.Score))
print(cbind(PC=1:20, r_S=round(pc_cor_S,3), r_G2M=round(pc_cor_G2M,3)))

OUTPUT

      PC    r_S  r_G2M
 [1,]  1 -0.262 -0.245
 [2,]  2  0.065  0.048
 [3,]  3  0.023  0.038
 [4,]  4  0.002 -0.020
 [5,]  5 -0.003  0.015
 [6,]  6 -0.003  0.090
 [7,]  7 -0.045 -0.038
 [8,]  8  0.007 -0.010
 [9,]  9 -0.014  0.012
[10,] 10 -0.021 -0.061
[11,] 11 -0.006  0.064
[12,] 12 -0.008  0.008
[13,] 13  0.031  0.042
[14,] 14  0.008 -0.027
[15,] 15  0.011  0.015
[16,] 16 -0.014 -0.007
[17,] 17 -0.010  0.011
[18,] 18  0.009  0.006
[19,] 19  0.003  0.027
[20,] 20 -0.015  0.026

These are PBMCs before and after treatment, there should be cells that are similar between both conditions, it looks like we’ll have to run some batch effect correction to overlay similar cell-types from both conditions to perform downstream analysis.

Discussion

Do you think we need to integrate our data? Hint: Look at the UMAP and PC1/PC2 plots we made above

What do you think would happen if we were to perform unsupervised clustering right now, without integrating our data (or overlaying similar cells on top of each other from both conditions)?

Step 4: Integrating our data using the harmony method

Seurat v5 has made it really easy to test different integration methods quickly, let’s use a really popular approach (harmony) first.

R

# code adapted from: https://satijalab.org/seurat/articles/seurat5_integration
ifnb.filtered <- IntegrateLayers(object = ifnb.filtered,
                                 method = HarmonyIntegration,
                                 orig.reduction = "pca", 
                                 new.reduction = "harmony")

OUTPUT

The `features` argument is ignored by `HarmonyIntegration`.
Transposing data matrix

Using automatic lambda estimation

Initializing state using k-means centroids initialization

Harmony 1/10

Harmony 2/10

Harmony 3/10

Harmony converged after 3 iterations

This message is displayed once per session.

R

ifnb.filtered <- RunUMAP(ifnb.filtered, reduction = "harmony", dims = 1:20, reduction.name = "umap.harmony")

OUTPUT

00:44:41 UMAP embedding parameters a = 0.9922 b = 1.112
00:44:41 Read 13548 rows and found 20 numeric columns
00:44:41 Using Annoy for neighbor search, n_neighbors = 30
00:44:41 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
00:44:43 Writing NN index file to temp file /tmp/RtmpKAZjB2/file272b67635406
00:44:43 Searching Annoy index using 1 thread, search_k = 3000
00:44:47 Annoy recall = 100%
00:44:48 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
00:44:50 Initializing from normalized Laplacian + noise (using RSpectra)
00:44:51 Commencing optimization for 200 epochs, with 586822 positive edges
00:44:51 Using rng type: pcg
00:44:57 Optimization finished

R

after.harmony <- DimPlot(ifnb.filtered, reduction = "umap.harmony", group.by = "stim") + 
  ggtitle("After Harmony Integration")
before.integration <- DimPlot(ifnb.filtered, reduction = "umap", group.by = "stim") +
  ggtitle("Before Integration")

before.integration | after.harmony

Discussion

Looking at the UMAPs above, do you think integration was successful?

Step 5: Integrating our data using an alternative Seurat CCA method

R

ifnb.filtered <- IntegrateLayers(object = ifnb.filtered,
                                 method = CCAIntegration,
                                 orig.reduction = "pca", 
                                 new.reduction = "integrated.cca")

ifnb.filtered <- RunUMAP(ifnb.filtered, reduction = "integrated.cca", dims = 1:20, reduction.name = "umap.cca")

after.seuratCCA <- DimPlot(ifnb.filtered, reduction = "umap.cca", group.by = "stim") +
  ggtitle("After Seurat CCA Integration")

before.integration | after.seuratCCA

R

after.harmony | after.seuratCCA

R

## Show example slide of integration 'failing' but due to different cell types in each sample ***

Discussion

What do you think of the integration results now?

Hint: Also look at the PC1 and PC2 plots for each integration method.

Challenge

Cell Cycle Check 2 — AFTER integration (after umap.cca + clustering)

Now that we have integrated the data, do you think the results will be the same or different?

Show me the solution

R

# ---- Cell-cycle check (POST-integration) ----

# Visual on integrated embedding
DimPlot(ifnb.filtered, reduction = "umap.cca", group.by = "Phase", pt.size = 0.3)

R

# Phase composition by cluster and by condition
tab_phase_cluster <- prop.table(table(ifnb.filtered$seurat_clusters, ifnb.filtered$Phase), 1) * 100

ERROR

Error in `x[[i, drop = TRUE]]` at SeuratObject/R/seurat.R:2939:3:
! 'seurat_clusters' not found in this Seurat object
 Did you mean "seurat_annotations"?

R

tab_phase_cond    <- prop.table(table(ifnb.filtered$stim,            ifnb.filtered$Phase), 1) * 100
pheatmap(tab_phase_cluster,
         main = "Phase (%) by cluster",
         display_numbers = TRUE,
         number_format = "%.1f")

ERROR

Error: object 'tab_phase_cluster' not found

R

pheatmap(tab_phase_cond,    main = "Phase (%) by condition (stim)")

Step 6: Perform standard clustering steps after integration

This step collapses individual control and treatment datasets together and needs to be done before differential expression analysis

R

ifnb.filtered <- FindNeighbors(ifnb.filtered, reduction = "integrated.cca", dims = 1:20)

OUTPUT

Computing nearest neighbor graph

OUTPUT

Computing SNN

R

ifnb.filtered <- FindClusters(ifnb.filtered, resolution = 0.5)

OUTPUT

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 13548
Number of edges: 521570

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9002
Number of communities: 13
Elapsed time: 2 seconds

R

ifnb.filtered <- JoinLayers(ifnb.filtered)

Additional Challenges

Challenge

You can also use K-means clustering to cluster the data to compare to other clustering methods. How can you use the kmeans() function from stats to cluster the data and visualise it using DimPlot()?

In this example, we used k = 5 purely for illustration. As you can see, it produces fewer clusters compared to the default Louvain algorithm. You are welcome to try different k values in your own time to explore whether k-means clustering is a suitable option in this context.

Show me the solution

R

# K-means
emb <- Embeddings(ifnb.filtered, "pca")[, 1:20]
set.seed(1)
km <- kmeans(emb, centers = 12, nstart = 50)

ifnb.filtered$kmeans_k12 <- factor(km$cluster)

# Compare labelings
p1 <- DimPlot(ifnb.filtered, reduction = "umap.cca", group.by = "seurat_clusters") + ggtitle("Louvain")
p2 <- DimPlot(ifnb.filtered, reduction = "umap.cca", group.by = "kmeans_k12") + ggtitle("k-means (K=12)")
p1 | p2

R

# If you decide to proceed with k-means downstream:
Idents(ifnb.filtered) <- "kmeans_k12"

Key Points

QC filtering removes low-quality cells (e.g., low gene count or high mitochondrial %).
Integration corrects sample-to-sample variation so cells group by biology, not by batch.
Harmony and CCA both align shared cell states but use different mathematical strategies.