Content from Exploring the dataset


Last updated on 2025-11-14 | Edit this page

Overview

Questions

  • What is the data we are using in this tutorial?

Objectives

  • Understand the dataset we are using in this tutorial.
  • Learn how to download publicly available proteomics data from within R.

The data


The data we are using in this tutorial comes from a study done exploring a non-invasive diagnostic approach for Inflammatory Bowel Disease (IBD) using mass spectrometry and machine learning to analyse the stool proteome. Traditional diagnostic methods like colonoscopies are invasive, prompting the need for alternatives. The researchers collected 123 stool samples and identified 48 differentially expressed proteins, narrowing them down to 7 key proteins using feature selection. A Support Vector Machine (SVM) model was developed, achieving 96% sensitivity and 76% specificity in distinguishing active IBD patients from symptomatic non-IBD patients. This approach demonstrates the potential for accurate, non-invasive IBD diagnosis, improving patient management and reducing the need for invasive procedures.

For the purpose of this tutorial, we will be using a subset of the total data presented in the paper, including:

  • 11 control (Ctrl) samples

    • 6 from batch one + 5 from batch three
  • 9 active Crohn’s disease (aCD) samples

    • 5 from batch one + 4 from batch three

Reference:

Shajari, E.; Gagné, D.; Malick, M.; Roy, P.; Noël, J.-F.; Gagnon, H.; Brunet, M.A.; Delisle, M.; Boisvert, F.-M.; Beaulieu, J.-F. Application of SWATH Mass Spectrometry and Machine Learning in the Diagnosis of Inflammatory Bowel Disease Based on the Stool Proteome. Biomedicines 2024, 12, 333. https://doi.org/10.3390/biomedicines12020333

Challenge

Challenge

Can you see where to find the publicly available data in the paper?

Publicly available data can be found under the ‘Data Availability Statement’ in the paper.

Loading the data


Let’s start by loading the packages required to access data stored on PRIDE.

PRIDE is part of the ProteomeXchange Consortium, which provides open access to proteomics data.

For more background on navigating PRIDE data, you can look through this course: PRIDE Quick Tour (EMBL-EBI Training)

R

library(rpx)
library(dplyr)

The rpx package gives us programmatic access to the ProteomeXchange without having to leave R.

R

# Set the dataset identifier and load
px_id <- 'PXD047585' 
px <- PXDataset(px_id)

We can then perform several functions to extract the contents of the px object:

R

# Print the dataset title
pxtitle(px)

OUTPUT

[1] "Application of SWATH Mass Spectrometry and Machine Learning in Diagnosis of Inflammatory Bowel Disease Based on Stool Proteome"

R

# Print a link to the dataset on PRIDE
pxurl(px)

OUTPUT

[1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2024/02/PXD047585"

R

# Print the taxonomy (organism) information for the dataset
pxtax(px)

OUTPUT

[1] "Homo sapiens (human)"

Exploring dataset files

Let’s see which files are included in this dataset.

R

# Retrieve a list of dataset files
px_files <- pxfiles(px)

R

# Display the first few files
head(px_files)

OUTPUT

[1] "20201016_UdSjfb_20190115_freshstool_003.mzML"
[2] "20201016_UdSjfb_20190115_freshstool_003.wiff"
[3] "20201016_UdSjfb_20190115_freshstool_003.wiff.scan"
[4] "20201016_UdSjfb_20190115_freshstool_005.mzML"
[5] "20201016_UdSjfb_20190115_freshstool_005.wiff"
[6] "20201016_UdSjfb_20190115_freshstool_005.wiff.scan"

PRIDE datasets often include a mix of raw data, search results, processed outputs, and metadata.

To understand the data composition, we’ll examine file extensions.

R

# Count the number of files by extension
table(sapply(strsplit(px_files, "\\."), tail, 1))

OUTPUT


    fas    mzML     pdf    scan speclib     tsv     txt    wiff    xlsx
      1      78       1      78       2      13       1      78       3 

Alternatively, we can get the same result with the tools package:

R

library(tools)
table(file_ext(px_files))

OUTPUT


    fas    mzML     pdf    scan speclib     tsv     txt    wiff    xlsx
      1      78       1      78       2      13       1      78       3 

Each file extension represents a specific proteomics data type. For example, .raw may indicate vendor instrument output and .mzML indicates an open format for MS data.

For more information about different proteomics file formats, check out:

PRIDE File Formats Documentation

Paper on mass spec file formats (PMC3518119)

Challenge

Challenge

Try visiting the dataset page on the PRIDE website and compare:

  • The number of files listed there versus those returned by rpx

  • The types of files (e.g., .raw, .mzML, .txt, etc.)

Questions to consider:

  • Why might the PRIDE web interface and R output differ?

  • What are the advantages of using programmatic access via rpx?

Programmatic access allows you to automate each step of your analysis, making your workflow fully reproducible.

Downloading required files

Good news! We have already downloaded and processed the data for you, as described in the next lesson. You should have downloaded this from the Setup page.

If you wish to download files from another dataset, you can use the curl package as demonstrated below.

R

library(curl)

curl::curl_download(url='https://ftp.pride.ebi.ac.uk/pride/data/archive/2024/02/PXD047585/SampleAnnotation.xlsx',destfile = 'data/SampleAnnotation.xlsx')

Note:

  • We retrieved the url in an earlier step using pxurl().

    • We have replaced ftp:// at the start with https://
    • We have specified the name of the file we wish to download (SampleAnnotation.xlsx), from the list we sourced earlier using pxfiles()
  • We set destfile to our desired name and location for the downloaded file.


Key Points
  • The dataset used in this tutorial includes stool samples from a study investigating non-invasive diagnostic methods for Inflammatory Bowel Disease.
  • We can use the rpx package to download and explore data from PRIDE or the ProteomeXchange Consortium without leaving R.

Content from Processing data with DIA-NN


Last updated on 2025-11-14 | Edit this page

Overview

Questions

  • How can I process my output files from a mass spectrometry-based proteomics experiment?

Objectives

  • Learn how to process proteomics data using DIA-NN.

Processing proteomics data


There are many software options for processing mass spectrometry-based proteomics output. These software match mass spectra against an empirical or predicted spectral library to identify peptides and infer the protein to which they belong.

Some software you may have heard of include Spectronaut, Fragpipe, MaxQuant and DIA-NN. Each software have slightly different functionality and attributes, and are frequently updated. In this tutorial, we are demonstrating the use of DIA-NN, a software designed to process DIA proteomics data.

Proteomics software like DIA-NN match observed spectra against a spectral library to identify peptides and assign them to a protein or protein group. Source: Galaxy Training Network
Proteomics software like DIA-NN match observed spectra against a spectral library to identify peptides and assign them to a protein or protein group. Source: Galaxy Training Network


Why DIA-NN?

Key features of DIA-NN software include:

  • Neural network–based scoring: Improves identification confidence.
  • Interference correction: Enhances quantification accuracy.
  • Library-free analysis: Can work without a spectral library (generating one directly from DIA data).
  • High speed and efficiency: Suitable for large-scale datasets.
  • Cross-run normalisation: Ensures consistent quantification across experiments.
  • Command-line executable: DIA-NN can be run from a graphical user interface (GUI) or directly from the command line, making it easy to integrate into an automated pipeline.

DIA-NN is free to download. Detailed operating instructions and a link to install the latest version of DIA-NN can be found on the GitHub page.

DIA-NN workflow


Callout

Important Note

Due to the amount of time it will take to process the data, you do not need to install or operate DIA-NN for the purposes of this tutorial. Pre-processed data is provided, which has been prepared following the procedures described below.

Part 1: Creating a spectral library

DIA-NN analysis consists of two steps. The first step is to generate a predicted spectral library for your experimental organism.

  1. Download a sequence database from UniProt.

    1. Go to https://www.uniprot.org/uniprotkb?query=*

    2. Select the relevant organism for your sample and Reviewed (Swiss-Prot) status to filter results.

    3. Click Download

    4. Change the format to FASTA (canonical and isoform)

    5. Select No under Compressed and download the FASTA file

Callout

FOR SPECIFIC SEARCHES

If you need to search for a variant protein or peptide that differs from the standard organism proteome, you can manually edit the protein sequences in the UniProt FASTA using a text editor.

  1. Download a contaminant FASTA file. For most experiments, the Universal Contaminants.fasta will be appropriate, but you may prefer to use one of the sample-type specific contaminant FASTAs depending on your experiment. More information about this resource and contaminants will be discussed in the next lesson.

  2. Open the DIA-NN GUI or a command line interface. If there are multiple versions of DIA-NN installed on your device, ensure to update DIA-NN exe with the file path to the version you wish to use.

  3. Add the UniProt and Contaminant FASTA files you just downloaded.

Caution

At the time of writing, DIA-NN 2.2.0 cannot read or write files saved in an external drive. Please ensure all input and output files are saved on a local drive to avoid error.

  1. Examples are included below for the additional settings we recommend changing from the default. Depending on your experiment and research goals, you may wish to change additional settings from the default - read more in the DIA-NN documentation.

  2. Run. DIA-NN will output a predicted spectral library in the .speclib file type.

Example of the DIA-NN GUI used to generate a predicted spectral library. Settings which have been altered from the default are highlighted in red.
Example of the DIA-NN GUI used to generate a predicted spectral library. Settings which have been altered from the default are highlighted in red.

BASH

C:\DIA-NN\2.2.0\diann.exe ` # Replace with file path to DIA-NN on your device
--lib "" `
--out "C:\path\to\output\report.parquet" ` # Replace with desired file path to output report.parquet
--out-lib "C:\path\to\output\report-lib.parquet" ` # Replace with desired file path to output report-lib.parquet
--fasta "C:\path\to\UniprotFasta\.fasta" ` # Replace with file path to Uniprot FASTA
--fasta "C:\path\to\ContaminantsFasta\.fasta" ` # Replace with file path to contaminants FASTA
--threads 30 ` # Replace with the number of threads to use for this analysis
--verbose 1 `
--qvalue 0.01 `
--matrices  `
--gen-spec-lib `
--predictor `
--reannotate `
--fasta-search `
--min-fr-mz 200 `
--max-fr-mz 1800 `
--met-excision `
--min-pep-len 7 `
--max-pep-len 30 `
--min-pr-mz 300 `
--max-pr-mz 1800 `
--min-pr-charge 1 `
--max-pr-charge 4 `
--cut K*,R* `
--missed-cleavages 1 `
--unimod4 `
--rt-profiling 

This spectral library can be re-used for all analyses of samples from the same organism. It is recommended to re-download the UniProt FASTA and re-run this step approximately once a month to ensure your library is up-to-date.


Part 2: Analysis

The second step is to search your raw files (mass spectrometry output files) against the spectral library.

  1. Open the DIA-NN GUI or a command line interface. If there are multiple versions of DIA-NN installed on your device, ensure to update DIA-NN exe with the file path to the version you wish to use.

  2. Upload your raw files.

    • If using the GUI, under Input select Raw then select all of the files you wish to search.
    • If using the command line, you can use --f multiple times to list each file individually, or (recommended) use --dir [folder] to instruct DIA-NN to process all raw files saved in a particular folder.
  3. Upload your spectral library.

Callout

Empirical vs predicted spectral library

You may wish to generate an empirical spectral library from your data, and use this for future analyses. An empirical library will be smaller than a predicted library and therefore increase the speed of analysis; however, may result in fewer identifications. We only recommend using an empirical library if generated from high quality data.

You can generate an empirical library from your samples by selecting Generate spectral library and specifying the location of its output.

  1. Adjust settings according to your experiment and research goals. Again, examples are included below for the additional settings we recommend changing from the default. You may wish to change additional settings from the default - read more in the DIA-NN documentation.

    • In this example, the protease Trypsin was used to generate peptides during sample preparation.
    • Leaving Mass accuracy and MS1 accuracy on the default 0 means DIA-NN will optimise these parameters automatically for the first run in the experiment and then reuse the optimised settings for other runs. The DIA-NN documentation recommends different settings depending on the type of mass spectrometer used to generate your data. In this example, data were generated on a TripleTOF 6600, so we set Mass accuracy to 20.0 and MS1 accuracy to 12.0
    • Set the number of Threads to be no more than the number of logical cores on your computer.
    • In this experiment, we have set the maximum number of variable modifications to 0; however, it is common for this setting to be relaxed in the range of 1 - 3.
    • DIA-NN’s default setting is to activate match-between-runs. From the documentation: In MBR mode, DIA-NN does two passes over the data. During the first pass, it creates an empirical spectral library from the data. During the second pass, it reanalyses the experiment with this empirical library, which may result in much improved identification numbers, data completeness and quantification accuracy.
Example of the DIA-NN GUI used to analyse the data described in this tutorial. Settings which have been altered from the default are highlighted in red.
Example of the DIA-NN GUI used to analyse the data described in this tutorial. Settings which have been altered from the default are highlighted in red.

BASH

C:\DIA-NN\2.2.0\diann.exe ` # Replace with file path to DIA-NN on your device
--dir "C:\path\to\folder\of\raw\files" ` # Replace with file path to folder containing all raw files to be analysed
--lib "C:\path\to\SpectralLibrary\report-lib.predicted.speclib" ` # Replace with file path to spectral library
--out "C:\path\to\output\report.parquet" ` # Replace with desired file path to output report.parquet
--threads 30 ` # Replace with the number of threads to use for this analysis
--verbose 1 `
--qvalue 0.01 `
--matrices  `
--unimod4 `
--mass-acc 20 `
--mass-acc-ms1 12.0 `
--reanalyse `
--rt-profiling 
Challenge

Challenge

If you want to run DIA-NN from the command line but are unsure how to write the code, you can update your settings in the GUI, click ‘Run’, and copy the commands output in the log. That’s how we got the example code above!

Can you match each line of code to the corresponding setting in the GUI?

Check the description of available commands in the DIA-NN documentation

DIA-NN output


DIA-NN output includes several files, including:

Output file Description
log.txt A log of what DIA-NN has run. You should always use ctrl + F to check for any ‘ERROR’ or ‘WARNING’ messages issued by DIA-NN during your run.
.parquet This is what we will be using for subsequent analyses in this tutorial.
pg_matrix.tsv Matrix of protein values generated by DIA-NN.
pr_matrix.tsv Matrix of peptide values generated by DIA-NN.
trends.pdf A PDF report listing the number of peptides, proteins, and other quality control measures detected for each sample.
Challenge

Challenge

If DIA-NN is run through the command line rather than the GUI, the trends.pdf and runs.pdf reports are not automatically generated. Using the DIA-NN documentation, can you work out how to generate this report via the command line?

Run the diann-stats.py Python script provided when you download DIA-NN.

BASH

python C:\DIA-NN\2.2.0\diann-stats.py C:\path\to\report.parquet

We have now processed our raw files, identifying the peptides present in our samples and inferring the proteins to which they belong.

For more background information, see the original publication:
Demichev, V., et al. (2020). DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nature Methods. https://doi.org/10.1038/s41592-019-0638-x.

For further details about the different DIA-NN settings, see the documentation.

If you come across any issues, the creators are fairly active at responding to issues and discussions on their GitHub.

Key Points
  • Raw files from a mass spectrometry-based DIA proteomics experiment can be processed using DIA-NN (peptide identification and protein inference).
  • The DIA-NN workflow has two key steps: generating a predicted spectral library, and processing raw files.
  • The DIA-NN documentation contains detailed information about changing default settings as appropriate for your experiment.

Content from Data cleaning with R


Last updated on 2025-11-14 | Edit this page

Overview

Questions

  • How do we convert our processed proteomics data (DIA-NN output) into an analysis-ready format?

Objectives

  • Understand the steps required to clean proteomics data and generate analysis-ready protein values using limpa.

Load processed data and sample annotation


After processing raw proteomics data, further cleaning and analysis can be conducted using R, Python, or by importing data to other platforms like Perseus or Skyline.

In this tutorial, we will be teaching you one method of cleaning and analysing proteomics data using R. To follow along the remainder of this tutorial, you should open a new R script in RStudio and run the example code provided.

Callout

The limpa package

In this tutorial, we are using the limpa package to load and clean our data. This package was published in 2025 by the Smyth Lab, based at the Walter and Eliza Hall Institute of Medical Research (WEHI).

For background information, see the original publications:

Li, M., Cobbold, S. A., & Smyth, G. K. (2025). Quantification and differential analysis of mass spectrometry proteomics data with probabilistic recovery of information from missing values. bioRxiv, 2025.2004.2028.651125. https://doi.org/10.1101/2025.04.28.651125

Li, M., & Smyth, G. K. (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics, 39(5). https://doi.org/10.1093/bioinformatics/btad200

You can also find further information in the vignette or reference manual accessed via the Bioconductor page.

Setup

First, load the required R packages.

R

library(limpa)
library(readxl)
library(pheatmap)
library(EnhancedVolcano)
library(STRINGdb)
library(clusterProfiler)
library(org.Hs.eg.db)
library(arrow)
library(rpx)
library(dplyr)
library(stringr)

Load DIA-NN output

Load the peptide-level data processed by DIA-NN (.parquet file), filtering observations by quality metrics as recommended in the limpa vignette.

The peptide values will automatically be converted to log2 upon import.

R

y.peptide <- readDIANN(file='data/MBIntroToProteomics.parquet',format="parquet",
                       q.columns = c("Q.Value","Lib.Q.Value","Lib.PG.Q.Value"),
                       q.cutoffs = 0.01)

OUTPUT

Length of q-value columns does not match with length of q-value cutoffs. Use q.cutoffs[1] for all columns.
Challenge

Challenge

These quality filtering metrics are recommended in the limpa vignette for data searched in DIA-NN with match-between-runs.

What q.columns are recommended for data searched without MBR?

R

y.peptide <- readDIANN(file='data/MBIntroToProteomics.parquet',format="parquet",
                       q.columns = c("Q.Value","Global.Q.Value","Global.PG.Q.Value"),
                       q.cutoffs = 0.01)

Let’s look at the structure of the data.

R

names(y.peptide)

OUTPUT

[1] "E"     "genes"

R

head(y.peptide$E)
head(y.peptide$genes)

As you can see, our data has been imported as a limma EList object, with components ‘E’ (log2 peptide matrix) and ‘genes’ (feature annotation) containing information about each peptide.


Load sample annotation

We also need to import information about our samples for later analyses.

R

samples_stool <- as.data.frame(readxl::read_excel('data/SampleAnnotation.xlsx'))
head(samples_stool)

OUTPUT

  file_order Sample_id-new Batch Class
1         33       B1_29_c    B1   CDr
2         93       B1_87_e    B1   UCr
3         71       B1_66_e    B1   UCr
4          9       B1_08_b    B1   aCD
5         26       B1_22_c    B1   CDr
6         31       B1_27_c    B1   CDr
                                                                                                                                         file_name
1 D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_067-29.mzML
2    D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_187.mzML
3 D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_143-66.mzML
4  D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_019-8.mzML
5 D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_053-22.mzML
6 D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_063-27.mzML

We need to clean our sample annotation data to add an identifying column which matches the sample column names in our peptide matrix.

Given we are only using a subset of the main dataset in this tutorial, we can also filter the sample annotation dataframe to only include rows pertaining to the samples in our subset.

R

# Create sample_name column that matches peptide matrix column names
samples_stool$fixed_file_name <- gsub("\\\\", "/", samples_stool$file_name)
samples_stool$fixed_file_name <- basename(samples_stool$fixed_file_name)
samples_stool$sample_name <- gsub(".mzML", "",samples_stool$fixed_file_name)

# Filter to only samples present in our dataset
samples_stool <- samples_stool %>% filter(samples_stool$sample_name %in% colnames(y.peptide$E))

We can attach our sample annotation data to the EList object for later analyses.

R

# Ensure sample row order matches column order of the peptide matrix
sample_info <- samples_stool[samples_stool$sample_name %in% colnames(y.peptide$E),]
rownames(sample_info) <- sample_info$sample_name
y.peptide$E <- y.peptide$E[,rownames(sample_info)]

# Add experimental metadata to EList
y.peptide$targets <- sample_info[, c('Class', 'Batch')]

Quality filtering


After reading in the data, it is common to conduct additional filtering to improve the quality of your data and assist with downstream interpretation of results.

  • Filtering out non-proteotypic peptides: Removes peptides that belong to more than one protein.
  • Filtering out compound proteins: Removes peptides belonging to compound protein groups consisting of multiple proteins separated by “;” delimiters.
  • Filtering out singleton peptides: Removes peptides that are the only peptide belonging to that protein (i.e. keeps only proteins with multiple peptides).

R

y.peptide <- filterNonProteotypicPeptides(y.peptide)
y.peptide <- filterCompoundProteins(y.peptide)
y.peptide <- filterSingletonPeptides(y.peptide, min.n.peptides = 2)

These filtering steps are optional; if they are not run, limpa will still be able to conduct protein quantification. For more information about these peptide-level filters and whether they are appropriate for your experiment, please consult the limpa documentation.

Removing contaminants


Contaminant background signals can influence mass spectrometry-based proteomics data. It is essential to remove contaminant proteins from the sample preparation process to avoid confounding results.

Frankenfield et al. (2022) created a new set of sample-type specific and universal contaminant FASTA files which they showed reduced false identifications, increased protein identifications and did not influence protein quantification for DIA workflows. These FASTA files updated upon widely used contaminant protein lists from MaxQuant and the common Repository of Adventitious Proteins (cRAP).

In the previous lesson, we incorporated the universal contaminants FASTA into our spectral library. This FASTA is structured such that all contaminant proteins are named beginning with ‘Cont_’, which means we can simply remove any proteins from our analysis containing this string text in their name.

R

# Remove contaminant proteins from the genes dataframe
y.peptide$genes <- y.peptide$genes %>% filter(!str_detect(Protein.Group, "Cont_"))

# Match peptides in the matrix to only those remaining in the dataframe
y.peptide$E <- y.peptide$E[rownames(y.peptide$genes), ]
Callout

More about contaminants

You can download the contaminant FASTA files from GitHub.

If you are analysing data that has been processed without a contaminant FASTA, you can also download an Excel spreadsheet contaminant list from the same link and use this to identify and remove contaminant proteins from your dataset.

For more information, see the original publication:

Frankenfield, A. M., Ni, J., Ahmed, M., & Hao, L. (2022). Protein Contaminants Matter: Building Universal Protein Contaminant Libraries for DDA and DIA Proteomics. Journal of proteome research, 21(9), 2104–2113. https://doi.org/10.1021/acs.jproteome.2c00145

Protein quantification


Missing data has been a longstanding issue impacting the accuracy and reproducibility of mass spectrometry-based proteomic analyses, complicating downstream analyses. Most established imputation methods are not designed to account for the specific attributes of proteomics data, and there is no consensus in the literature for how missing values should be managed in proteomics experiments. This presents an issue with using the protein values directly generated by platforms like DIA-NN, which contain many missing values.

In this tutorial, we are using the dpcQuant() method of protein quantification from the limpa package, or Linear Models for Proteomics Data (Li, Cobbold & Smyth, 2025). The detection probability curve (DPC) models the relationship between peptide intensity and missingness, showing that almost all missing values in label-free proteomics data are not missing at random and should be accounted for (Li & Smyth, 2023). The dpcQuant() uses Bayesian statistical methods based on peptide intensities and the DPC coefficients for a particular dataset to generate a complete matrix of protein values without the need for imputation. This method outputs a linear model object suitable for downstream analysis with the limma package.

To learn more about the detection probability curve and the limpa package, please read the original publications.

Using limpa and the DPC for protein quantification

First, we will generate and visualise our detection probability curve.

The DPC coefficients are related to the dataset and software used for processing. A steeper slope means there is more statistical information in the pattern of missing data. A slope closer to 0 means there is more data missing at random.

R

# Estimate the DPC for our data
dpcfit <- dpcON(y.peptide, robust=TRUE)

# Print the intercept and slope of our DPC
dpcfit$dpc

OUTPUT

     beta0      beta1
-7.2678650  0.6033267 

R

# Plot the DPC
plotDPC(dpcfit)

Next, we can use the DPC to generate our log2 protein values. This function requires the following input:

  • a limma EList object, or a numeric matrix, of peptide-level log2 expression values, with samples as columns and peptides as rows (y.peptide)
  • vector of protein IDs ("Protein.Names")
  • numeric vector providing the slope and intercept of the DPC (dpcfit)

R

y.protein <- dpcQuant(y.peptide, "Protein.Names", dpc=dpcfit)

OUTPUT

Estimating hyperparameters ...

OUTPUT

Quantifying proteins ...

OUTPUT

Proteins: 280 Peptides: 2672

Note: our dataset is quite small, so this process should only take a few seconds. For larger datasets, it may take minutes or hours for this command to run.

Callout

Alternate versions of the DPC

The base dpc() assumes observed log-intensities are normally distributed.

A modified version, used in this tutorial, dpcON(robust=TRUE) downweighs outliers.

Alternatively, dpcCN() assumes the complete set of log-intensities are normally distributed. Using this method may underestimate the DPC slope if there is a non-small proportion of peptides missing by chance.

You can also choose your own slope by passing the argument dpc.slope= to dpcQuant() instead of dpc=dpcfit. Using this method, an intercept value will automatically be generated. You can check this value by using estimateDPCIntercept(). The authors recommend a slope of 0.6 - 0.8 is usually appropriate for data generated using DIA-NN.

You may wish to conduct further statistical analyses to determine which method is most appropriate for your experiment; however, the authors suggest that downstream analysis should not be significantly affected by differing DPC coefficients.

You can read more about the differences between each version of the DPC in the paper or the limpa manual.

Challenge

Challenge

The limpa package also provides a function to impute missing peptide values and generate a complete peptide matrix using the DPC. Can you search the limpa documentation to identify the command used to impute and quantify missing values in a rowwise manner?

dpcImpute()

dpcQuant() output

We have now generated an EList object for our protein values. Let’s investigate the other components of this object:

R

names(y.protein)

OUTPUT

[1] "E"       "genes"   "other"   "targets" "dpc"    

R

names(y.protein$other)

OUTPUT

[1] "n.observations" "standard.error"

R

names(y.protein$genes)

OUTPUT

[1] "Protein.Group" "Protein.Names" "Genes"         "NPeptides"
[5] "PropObs"      
EList component Description
E A complete matrix of log2 protein values with no missing data, with samples as columns and proteins as rows
genes Feature annotation dataframe, with columns for: UniProt IDs (Protein.Group), proteins (Protein.Names), gene names (Genes), number of peptides used to quantify each protein (NPeptides), and the proportion of possible observations used to generate each protein value (PropObs)
other List of two matrices with columns and rows corresponding to protein matrix E: number of peptides observed for each protein in each sample (n.observations), and the standard error associated with each protein value for each sample (standard.error)
targets Sample annotation dataframe
dpc DPC intercept and slope coefficient used for protein quantification
Challenge

Challenge

Compare the first few rows of the protein (E), n.observations and standard.error matrices. Take a look at the NPeptides and PropObs values for those proteins also. What do you notice about the relationship between these quality metrics?

Proteins with smaller NPeptides and PropObs values, and smaller n.observations for individual samples, have higher standard error.

The more peptide data available for a particular protein in a particular sample, the greater the confidence in the protein value, and therefore the lower the standard error.

If you are interested in a particular protein, it is important to look at these quality metrics for each of your samples.

Dimensionality Reduction and QC


Let’s take a look at our dataset. The plotMDSUsingSEs() function, or Multidimensional Scaling Plot of Gene Expression Profiles, Using Standard Errors, is similar to plotMDS() in the limma package, but takes account of the standard errors generated by dpcQuant(). This is essentially a PCA plot for the newly generated protein values that takes into consideration their standard error.

R

plotMDSUsingSEs(y.protein)

That plot is extremely messy and doesn’t tell us much! Let’s clean it up.

R

# Factorise and set colors for class and batch
Class <- factor(y.peptide$targets$Class)
Class.color <- Class
levels(Class.color) <- hcl.colors(nlevels(Class), palette = "cividis")

Batch <- factor(y.peptide$targets$Batch)
Batch.color <- Batch
levels(Batch.color) <- hcl.colors(nlevels(Batch), palette = "Set 2")

R

# Class visualisation
plotMDSUsingSEs(y.protein, pch=16, col=as.character(Class.color))

R

# Batch visualisation
plotMDSUsingSEs(y.protein, pch=16, col=as.character(Batch.color))

R

# Class (labels) and batch (colors) visualisation
plotMDSUsingSEs(y.protein, labels=Class, col=as.character(Batch.color))

We can already see a distinction between the aCD and Ctrl groups, which will be explored further in the next lesson.

It also looks like there are some differences between the batches. We should add batch as a covariate to our analyses in the next lesson.

Callout

Managing batch effects

Explaining the different methods of identifying, accounting and correcting for batch effects in proteomics data is an entire tutorial itself! For further information, we recommend the Melbourne Integrative Genomics Workshop: Managing batch effects in biological studies.

If you are concerned about batch effects in your data, University of Melbourne staff and students or affiliates can also contact us for a free consultation.

Normalisation

Normalisation is another key element of the proteomics data cleaning workflow. Normalisation techniques aim to remove non-biological variation in the data, for example, variation due to sample loading or batch effects.

Log transformation (which was automatically applied when we imported our peptide data) helps to normalise intensity values. The default RT-dependent normalisation applied during DIA-NN processing is also quite effective and often sufficient for DIA data; however, each experimental environment and techniques will have a different optimal approach. You may wish to apply an additional method of normalisation to your data.

To learn more about normalisation, keep an eye on the Melbourne Bioinformatics Eventbrite page for a normalisation workshop coming in 2026.

To learn more about different normalisation techniques for proteomics data, we recommend the following papers:

Peng, H., Wang, H., Kong, W., Li, J., & Goh, W. W. B. (2024). Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference. Nature communications, 15(1), 3922. https://doi.org/10.1038/s41467-024-47899-w

Tseng, C. Y., Salguero, J. A., Breidenbach, J. D., Solomon, E., Sanders, C. K., Harvey, T., Thornhill, M. G., Palmisano, S. J., Sasiene, Z. J., Blackwell, B. R., McBride, E. M., Luchini, K. A., LeBrun, E. S., Alvarez, M., Mach, P. M., Rivera, E. S., & Glaros, T. G. (2025). Evaluation of normalization strategies for mass spectrometry-based multi-omics datasets. Metabolomics : Official journal of the Metabolomic Society, 21(4), 98. https://doi.org/10.1007/s11306-025-02297-1

Välikangas, T., Suomi, T., & Elo, L. L. (2018). A systematic evaluation of normalization methods in quantitative label-free proteomics. Briefings in bioinformatics, 19(1), 1–11. https://doi.org/10.1093/bib/bbw095

Good work! We now have a clean dataset ready for analysis.


Key Points
  • Key elements of the proteomics data cleaning workflow include: quality filtering, removing contaminants, protein quantification, managing batch effects, and normalisation.
  • We can use the limpa package in R to clean our peptide data and generate a complete, analysis-ready protein matrix with corresponding standard error values.

Content from Differential expression analysis


Last updated on 2025-11-14 | Edit this page

Overview

Questions

  • How can we identify the differentially expressed proteins between our experimental groups?

Objectives

  • Conduct differential expression analysis using limpa
  • Visualise the top differentially expressed proteins

Differential Expression Analysis (DEA)


Now it is time to analyse our data and investigate proteins that are differentially expressed between our two experimental groups (control vs active Crohn’s disease). We will make use of the limma package and closely related limpa functions specifically designed for proteomics data.

First, we will fit a linear model that includes both Class and Batch effects.

R

# Set the control group as our reference
Class <- relevel(Class, ref = "Ctrl")
## We also need to reset our colors to match the correct labels
Class.color <- Class
levels(Class.color) <- hcl.colors(nlevels(Class), palette = "cividis")

# Fit a linear model
design <- model.matrix(~Class + Batch)
design

OUTPUT

   (Intercept) ClassaCD BatchB3
1            1        1       0
2            1        1       0
3            1        0       0
4            1        1       0
5            1        0       0
6            1        0       0
7            1        1       0
8            1        1       0
9            1        0       0
10           1        0       0
11           1        0       0
12           1        1       1
13           1        0       1
14           1        1       1
15           1        1       1
16           1        0       1
17           1        0       1
18           1        1       1
19           1        0       1
20           1        0       1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$Class
[1] "contr.treatment"

attr(,"contrasts")$Batch
[1] "contr.treatment"

We can see both Class and Batch accounted for in our design matrix.

Now we run the limpa differential expression analysis function dpcDE() to identify the top differentially expressed proteins in our dataset.

R

fit <- dpcDE(y.protein, design, plot = TRUE)

R

fit <- eBayes(fit)

# Save our DEA results as a dataframe
results <- topTable(fit, coef = 2, number = Inf)
head(results, n=10)

OUTPUT

            Protein.Group Protein.Names   Genes NPeptides   PropObs    logFC
PRTN3_HUMAN        P24158   PRTN3_HUMAN   PRTN3        11 0.5318182 4.466373
S10A9_HUMAN        P06702   S10A9_HUMAN  S100A9        23 0.6086957 3.809771
S10A8_HUMAN        P05109   S10A8_HUMAN  S100A8        17 0.6500000 3.811580
TRFL_HUMAN         P02788    TRFL_HUMAN     LTF        46 0.3152174 3.998157
CEAM8_HUMAN        P31997   CEAM8_HUMAN CEACAM8         6 0.3583333 3.130768
SAMP_HUMAN         P02743    SAMP_HUMAN    APCS         8 0.5500000 2.720497
A1AG1_HUMAN        P02763   A1AG1_HUMAN    ORM1        11 0.4454545 3.349033
CAP7_HUMAN         P20160    CAP7_HUMAN    AZU1         6 0.4666667 3.200933
FRIH_HUMAN         P02794    FRIH_HUMAN    FTH1        12 0.5625000 2.528228
RNAS2_HUMAN        P10153   RNAS2_HUMAN  RNASE2         6 0.5666667 2.887339
              AveExpr        t      P.Value    adj.P.Val         B
PRTN3_HUMAN 11.146376 6.909177 8.925046e-08 2.499013e-05 7.9144156
S10A9_HUMAN 12.809535 6.354519 4.259264e-07 5.962969e-05 6.4365522
S10A8_HUMAN 12.865666 6.070836 9.561924e-07 8.924462e-05 5.6692556
TRFL_HUMAN   9.061431 5.755654 2.361294e-06 1.601370e-04 4.8072839
CEAM8_HUMAN 10.078482 5.689096 2.859590e-06 1.601370e-04 4.6265751
SAMP_HUMAN  11.028982 5.317255 8.351615e-06 3.897420e-04 3.6088844
A1AG1_HUMAN 10.284907 5.228167 1.079921e-05 4.319682e-04 3.3650142
CAP7_HUMAN  10.422828 4.613155 6.335616e-05 2.217466e-03 1.6820861
FRIH_HUMAN  11.341144 4.227543 1.897414e-04 5.396242e-03 0.6493737
RNAS2_HUMAN 11.228811 4.222021 1.927229e-04 5.396242e-03 0.6366061
Discussion

A closer look at the results

Our output table is ordered by B statistic. Run View(results) to take a closer look at some of the lower ranked proteins.

Are there any proteins with an absolute fold change > 2 (|logFC| > 1) that are not statistically significant? Can you think of some explanations for this?

Hint: Take note of the PropObs value and remember its relationship to standard error.

Visualise differentially expressed proteins


Expression values and standard error by sample

We can visualise the log2 expression values for individual proteins, together with standard errors.

R

plotProtein(y.protein, "PRTN3_HUMAN", col = as.character(Class.color))
legend('topleft', legend = levels(Class), fill = levels(Class.color))

R

plotProtein(y.protein, "S10A9_HUMAN", col = as.character(Class.color))
legend('topleft', legend = levels(Class), fill = levels(Class.color))
Challenge

Challenge

This simple function plots the protein values from our expression matrix, which we did not directly edit to mitigate batch effects. Try running the limma function to remove batch effects from our dataset and see how this changes your plot.

R

y.protein.rbe <- y.protein
y.protein.rbe$E <- removeBatchEffect(y.protein,batch = y.protein$targets$Batch, group=y.protein$targets$Class)

plotProtein(y.protein.rbe, "S10A9_HUMAN", col = as.character(Class.color))
legend('topleft', legend = levels(Class), fill = levels(Class.color))

Heatmap

We can filter our results to the up to 50 top most significant differentially expressed proteins, then visualise via a clustered heatmap.

R

# Filter for the up to 50 most significant results
sig_proteins <- results %>%
  filter(adj.P.Val < 0.05) %>% # filter by p value
  top_n(50, wt = abs(logFC)) %>% # filter by absolute log fold change
  pull(Protein.Names)
expr_matrix <- y.protein$E[sig_proteins, ]

# Scale the data and visualise via heatmap
scaled_expr <- t(scale(t(expr_matrix)))

pheatmap(scaled_expr,
         cluster_rows = TRUE,
         cluster_cols = TRUE,
         show_rownames = TRUE,
         show_colnames = FALSE,
         annotation_col = y.protein$targets,
         clustering_method = 'ward.D')
Discussion

Challenge

How would you interpret this heatmap?

Volcano plot

We can also visualise our results as a volcano plot using EnhancedVolcano().

R

EnhancedVolcano(results,
                lab = results$Genes,
                x = 'logFC',
                y = 'adj.P.Val',
                pCutoff = 0.05,
                FCcutoff = 1.0,
                pointSize = 2.0,
                labSize = 3.5,
                drawConnectors = TRUE,
                maxoverlapsConnectors = Inf,
                lengthConnectors = unit(0.001, 'npc'),
                typeConnectors ='closed',
                ylim = c(0, 5),
                xlim = c(-5, 5))
Key Points
  • We can fit a linear model to our data and conduct differential expression analysis using limpa and limma functions.
  • There are many different ways to visualise differentially expressed proteins.

Content from Network and Enrichment Analysis


Last updated on 2025-11-14 | Edit this page

Overview

Questions

  • How can DEA results be used to identify biologically meaningful patterns?
  • How can we visualise and interpret protein–protein interaction (PPI) networks and pathway enrichment results?

Objectives

  • Visualise significant proteins using network analysis (STRING)
  • Identify enriched biological processes and pathways (GO and KEGG)

Now that we have a list of differentially expressed (DE) proteins, we can explore how these proteins interact with each other and what biological functions or pathways they are involved in.

This step helps translate statistical significance into biological meaning by integrating our results with known protein–protein interactions (PPIs) and functional annotation databases.

STRING protein–protein interaction network analysis


The STRING database integrates known and predicted PPIs from multiple sources (e.g. experimental data, text mining, prediction methods, and curated databases).

We can visualise DE proteins within the context of their interaction network to identify clusters or modules potentially corresponding to biological pathways.

R

# Create an instance of the STRINGdb reference class for humans (species 9606)
string_db <- STRINGdb$new(version = "11.5", species = 9606, score_threshold = 400)

# Map STRING identifiers to gene identifiers in our DEA results dataframe
mapped <- string_db$map(results, "Protein.Names", removeUnmappedRows = TRUE)

# Only map the proteins of significance
de_proteins_mapped <- mapped %>% filter(adj.P.Val < 0.05)

# Plot network of significant proteins
string_db$plot_network(de_proteins_mapped$STRING_id)

Note: if the above code takes too long to load, you can download an RData that contains the string_db and mapped functions here.

Discussion

How many clusters are there?

How many of these interactions are predicted versus experimentally validated? (See image below for legend)

You can explore how these proteins interact in an interactive setting by going to https://string-db.org/ and further exploring your proteins of interest.

Network plot for PRTN3 from string-db.org
Network plot for PRTN3 from string-db.org
Callout

More data exploration using STRINGdb

There are many ways of exploring your data using the STRING database - too many to cover in one tutorial!

You can learn more by reading the vignette or inspecting available functions within the STRINGdb package by running:

R

STRINGdb$methods()

OUTPUT

 [1] ".objectPackage"                      ".objectParent"
 [3] "add_diff_exp_color"                  "add_proteins_description"
 [5] "benchmark_ppi"                       "benchmark_ppi_pathway_view"
 [7] "callSuper"                           "copy"
 [9] "enrichment_heatmap"                  "export"
[11] "field"                               "get_aliases"
[13] "get_annotations"                     "get_bioc_graph"
[15] "get_clusters"                        "get_enrichment"
[17] "get_graph"                           "get_homologs"
[19] "get_homologs_besthits"               "get_homology_graph"
[21] "get_interactions"                    "get_link"
[23] "get_neighbors"                       "get_paralogs"
[25] "get_pathways_benchmarking_blackList" "get_png"
[27] "get_ppi_enrichment"                  "get_ppi_enrichment_full"
[29] "get_proteins"                        "get_pubmed"
[31] "get_pubmed_interaction"              "get_subnetwork"
[33] "get_summary"                         "get_term_proteins"
[35] "getClass"                            "getRefClass"
[37] "import"                              "initFields"
[39] "initialize"                          "load"
[41] "load_all"                            "map"
[43] "mp"                                  "plot_network"
[45] "plot_ppi_enrichment"                 "post_payload"
[47] "ppi_enrichment"                      "remove_homologous_interactions"
[49] "set_background"                      "show"
[51] "show#envRefClass"                    "trace"
[53] "untrace"                             "usingMethods"                       

Szklarczyk, D., Nastou, K., Koutrouli, M., Kirsch, R., Mehryary, F., Hachilif, R., Hu, D., Peluso, M. E., Huang, Q., Fang, T., Doncheva, N. T., Pyysalo, S., Bork, P., Jensen, L. J., & von Mering, C. (2025). The STRING database in 2025: protein networks with directionality of regulation. Nucleic acids research, 53(D1), D730–D737. https://doi.org/10.1093/nar/gkae1113

Functional enrichment analysis (GO and KEGG)


Using the significant DE proteins, we can perform functional enrichment analysis to determine whether certain biological processes (BP), cellular components (CC), or molecular functions (MF) are over-represented.

Two common types of enrichment exploration are:

  • Gene Ontology (GO): describes BPs, CCs, amd MFs
  • KEGG pathways: curated molecular interactions


GO enrichment analysis

R

#Subset genes from results table
de_proteins <- results %>% filter(adj.P.Val < 0.05)

# Convert UniProt IDs to Entrez IDs for enrichment analysis
converted <- bitr(de_proteins$Protein.Group,
                  fromType = "UNIPROT",
                  toType = "ENTREZID",
                  OrgDb = org.Hs.eg.db)

OUTPUT

'select()' returned 1:1 mapping between keys and columns

WARNING

Warning in bitr(de_proteins$Protein.Group, fromType = "UNIPROT", toType =
"ENTREZID", : 7.69% of input gene IDs are fail to map...

R

# GO Biological Process enrichment
ego <- enrichGO(gene = converted$ENTREZID,
                OrgDb = org.Hs.eg.db,
                ont = "ALL", # looking at all 3 subontologies
                pAdjustMethod = "BH",
                pvalueCutoff = 0.05,
                readable = TRUE)

# Show plot
dotplot(ego, showCategory = 10)

KEGG pathway enrichment analysis

R

# KEGG pathway enrichment
ekegg <- enrichKEGG(gene = converted$ENTREZID,
                    organism = 'hsa',
                    pvalueCutoff = 0.05)

OUTPUT

Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...

OUTPUT

Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...

R

dotplot(ekegg, showCategory = 10)
Challenge

Challenge

How do these figures compare to Figure 4 in the original paper?

While there are many similar pathways, the exact pathways are not identical. This is likely due to the fact we are using a subset of the larger dataset in this tutorial, so we have a slightly different list of DE proteins.

Key Points
  • Differential expression analysis identifies statistically significant changes in protein abundance between conditions, but does not necessarily tell us the biological significance of such identified proteins.
  • Network analysis (STRING) can reveal physical or functional interactions among DE proteins.
  • Enrichment analysis (GO/KEGG) can link those proteins to known biological processes or pathways.
  • Using multiple approaches can help you interpret proteomics data to measure reliability and consistency of results.