Differential expression analysis

Last updated on 2025-11-14 | Edit this page

Overview

Questions

  • How can we identify the differentially expressed proteins between our experimental groups?

Objectives

  • Conduct differential expression analysis using limpa
  • Visualise the top differentially expressed proteins

Differential Expression Analysis (DEA)


Now it is time to analyse our data and investigate proteins that are differentially expressed between our two experimental groups (control vs active Crohn’s disease). We will make use of the limma package and closely related limpa functions specifically designed for proteomics data.

First, we will fit a linear model that includes both Class and Batch effects.

R

# Set the control group as our reference
Class <- relevel(Class, ref = "Ctrl")
## We also need to reset our colors to match the correct labels
Class.color <- Class
levels(Class.color) <- hcl.colors(nlevels(Class), palette = "cividis")

# Fit a linear model
design <- model.matrix(~Class + Batch)
design

OUTPUT

   (Intercept) ClassaCD BatchB3
1            1        1       0
2            1        1       0
3            1        0       0
4            1        1       0
5            1        0       0
6            1        0       0
7            1        1       0
8            1        1       0
9            1        0       0
10           1        0       0
11           1        0       0
12           1        1       1
13           1        0       1
14           1        1       1
15           1        1       1
16           1        0       1
17           1        0       1
18           1        1       1
19           1        0       1
20           1        0       1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$Class
[1] "contr.treatment"

attr(,"contrasts")$Batch
[1] "contr.treatment"

We can see both Class and Batch accounted for in our design matrix.

Now we run the limpa differential expression analysis function dpcDE() to identify the top differentially expressed proteins in our dataset.

R

fit <- dpcDE(y.protein, design, plot = TRUE)

R

fit <- eBayes(fit)

# Save our DEA results as a dataframe
results <- topTable(fit, coef = 2, number = Inf)
head(results, n=10)

OUTPUT

            Protein.Group Protein.Names   Genes NPeptides   PropObs    logFC
PRTN3_HUMAN        P24158   PRTN3_HUMAN   PRTN3        11 0.5318182 4.466373
S10A9_HUMAN        P06702   S10A9_HUMAN  S100A9        23 0.6086957 3.809771
S10A8_HUMAN        P05109   S10A8_HUMAN  S100A8        17 0.6500000 3.811580
TRFL_HUMAN         P02788    TRFL_HUMAN     LTF        46 0.3152174 3.998157
CEAM8_HUMAN        P31997   CEAM8_HUMAN CEACAM8         6 0.3583333 3.130768
SAMP_HUMAN         P02743    SAMP_HUMAN    APCS         8 0.5500000 2.720497
A1AG1_HUMAN        P02763   A1AG1_HUMAN    ORM1        11 0.4454545 3.349033
CAP7_HUMAN         P20160    CAP7_HUMAN    AZU1         6 0.4666667 3.200933
FRIH_HUMAN         P02794    FRIH_HUMAN    FTH1        12 0.5625000 2.528228
RNAS2_HUMAN        P10153   RNAS2_HUMAN  RNASE2         6 0.5666667 2.887339
              AveExpr        t      P.Value    adj.P.Val         B
PRTN3_HUMAN 11.146376 6.909177 8.925046e-08 2.499013e-05 7.9144156
S10A9_HUMAN 12.809535 6.354519 4.259264e-07 5.962969e-05 6.4365522
S10A8_HUMAN 12.865666 6.070836 9.561924e-07 8.924462e-05 5.6692556
TRFL_HUMAN   9.061431 5.755654 2.361294e-06 1.601370e-04 4.8072839
CEAM8_HUMAN 10.078482 5.689096 2.859590e-06 1.601370e-04 4.6265751
SAMP_HUMAN  11.028982 5.317255 8.351615e-06 3.897420e-04 3.6088844
A1AG1_HUMAN 10.284907 5.228167 1.079921e-05 4.319682e-04 3.3650142
CAP7_HUMAN  10.422828 4.613155 6.335616e-05 2.217466e-03 1.6820861
FRIH_HUMAN  11.341144 4.227543 1.897414e-04 5.396242e-03 0.6493737
RNAS2_HUMAN 11.228811 4.222021 1.927229e-04 5.396242e-03 0.6366061
Discussion

A closer look at the results

Our output table is ordered by B statistic. Run View(results) to take a closer look at some of the lower ranked proteins.

Are there any proteins with an absolute fold change > 2 (|logFC| > 1) that are not statistically significant? Can you think of some explanations for this?

Hint: Take note of the PropObs value and remember its relationship to standard error.

Visualise differentially expressed proteins


Expression values and standard error by sample

We can visualise the log2 expression values for individual proteins, together with standard errors.

R

plotProtein(y.protein, "PRTN3_HUMAN", col = as.character(Class.color))
legend('topleft', legend = levels(Class), fill = levels(Class.color))

R

plotProtein(y.protein, "S10A9_HUMAN", col = as.character(Class.color))
legend('topleft', legend = levels(Class), fill = levels(Class.color))
Challenge

Challenge

This simple function plots the protein values from our expression matrix, which we did not directly edit to mitigate batch effects. Try running the limma function to remove batch effects from our dataset and see how this changes your plot.

R

y.protein.rbe <- y.protein
y.protein.rbe$E <- removeBatchEffect(y.protein,batch = y.protein$targets$Batch, group=y.protein$targets$Class)

plotProtein(y.protein.rbe, "S10A9_HUMAN", col = as.character(Class.color))
legend('topleft', legend = levels(Class), fill = levels(Class.color))

Heatmap

We can filter our results to the up to 50 top most significant differentially expressed proteins, then visualise via a clustered heatmap.

R

# Filter for the up to 50 most significant results
sig_proteins <- results %>%
  filter(adj.P.Val < 0.05) %>% # filter by p value
  top_n(50, wt = abs(logFC)) %>% # filter by absolute log fold change
  pull(Protein.Names)
expr_matrix <- y.protein$E[sig_proteins, ]

# Scale the data and visualise via heatmap
scaled_expr <- t(scale(t(expr_matrix)))

pheatmap(scaled_expr,
         cluster_rows = TRUE,
         cluster_cols = TRUE,
         show_rownames = TRUE,
         show_colnames = FALSE,
         annotation_col = y.protein$targets,
         clustering_method = 'ward.D')
Discussion

Challenge

How would you interpret this heatmap?

Volcano plot

We can also visualise our results as a volcano plot using EnhancedVolcano().

R

EnhancedVolcano(results,
                lab = results$Genes,
                x = 'logFC',
                y = 'adj.P.Val',
                pCutoff = 0.05,
                FCcutoff = 1.0,
                pointSize = 2.0,
                labSize = 3.5,
                drawConnectors = TRUE,
                maxoverlapsConnectors = Inf,
                lengthConnectors = unit(0.001, 'npc'),
                typeConnectors ='closed',
                ylim = c(0, 5),
                xlim = c(-5, 5))
Key Points
  • We can fit a linear model to our data and conduct differential expression analysis using limpa and limma functions.
  • There are many different ways to visualise differentially expressed proteins.