Data cleaning with R

Last updated on 2025-11-14 | Edit this page

Overview

Questions

How do we convert our processed proteomics data (DIA-NN output) into an analysis-ready format?

Objectives

Understand the steps required to clean proteomics data and generate analysis-ready protein values using limpa.

Load processed data and sample annotation

After processing raw proteomics data, further cleaning and analysis can be conducted using R, Python, or by importing data to other platforms like Perseus or Skyline.

In this tutorial, we will be teaching you one method of cleaning and analysing proteomics data using R. To follow along the remainder of this tutorial, you should open a new R script in RStudio and run the example code provided.

Callout

The limpa package

In this tutorial, we are using the limpa package to load and clean our data. This package was published in 2025 by the Smyth Lab, based at the Walter and Eliza Hall Institute of Medical Research (WEHI).

For background information, see the original publications:

Li, M., Cobbold, S. A., & Smyth, G. K. (2025). Quantification and differential analysis of mass spectrometry proteomics data with probabilistic recovery of information from missing values. bioRxiv, 2025.2004.2028.651125. https://doi.org/10.1101/2025.04.28.651125

Li, M., & Smyth, G. K. (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics, 39(5). https://doi.org/10.1093/bioinformatics/btad200

You can also find further information in the vignette or reference manual accessed via the Bioconductor page.

Setup

First, load the required R packages.

R

library(limpa)
library(readxl)
library(pheatmap)
library(EnhancedVolcano)
library(STRINGdb)
library(clusterProfiler)
library(org.Hs.eg.db)
library(arrow)
library(rpx)
library(dplyr)
library(stringr)

Load DIA-NN output

Load the peptide-level data processed by DIA-NN (.parquet file), filtering observations by quality metrics as recommended in the limpa vignette.

The peptide values will automatically be converted to log2 upon import.

R

y.peptide <- readDIANN(file='data/MBIntroToProteomics.parquet',format="parquet",
                       q.columns = c("Q.Value","Lib.Q.Value","Lib.PG.Q.Value"),
                       q.cutoffs = 0.01)

OUTPUT

Length of q-value columns does not match with length of q-value cutoffs. Use q.cutoffs[1] for all columns.

Challenge

These quality filtering metrics are recommended in the limpa vignette for data searched in DIA-NN with match-between-runs.

What q.columns are recommended for data searched without MBR?

Show me the solution

R

y.peptide <- readDIANN(file='data/MBIntroToProteomics.parquet',format="parquet",
                       q.columns = c("Q.Value","Global.Q.Value","Global.PG.Q.Value"),
                       q.cutoffs = 0.01)

Let’s look at the structure of the data.

R

names(y.peptide)

OUTPUT

[1] "E"     "genes"

R

head(y.peptide$E)
head(y.peptide$genes)

As you can see, our data has been imported as a limma EList object, with components ‘E’ (log2 peptide matrix) and ‘genes’ (feature annotation) containing information about each peptide.

Load sample annotation

We also need to import information about our samples for later analyses.

R

samples_stool <- as.data.frame(readxl::read_excel('data/SampleAnnotation.xlsx'))
head(samples_stool)

OUTPUT

  file_order Sample_id-new Batch Class
1         33       B1_29_c    B1   CDr
2         93       B1_87_e    B1   UCr
3         71       B1_66_e    B1   UCr
4          9       B1_08_b    B1   aCD
5         26       B1_22_c    B1   CDr
6         31       B1_27_c    B1   CDr
                                                                                                                                         file_name
1 D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_067-29.mzML
2    D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_187.mzML
3 D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_143-66.mzML
4  D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_019-8.mzML
5 D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_053-22.mzML
6 D:\\Elmira\\IBD\\01-MSconverter\\00-DIA_NN input-3 batches\\mix.mzMLpeakpicking(batch 1,2)\\Mix\\20201016_UdSjfb_20190115_freshstool_063-27.mzML

We need to clean our sample annotation data to add an identifying column which matches the sample column names in our peptide matrix.

Given we are only using a subset of the main dataset in this tutorial, we can also filter the sample annotation dataframe to only include rows pertaining to the samples in our subset.

R

# Create sample_name column that matches peptide matrix column names
samples_stool$fixed_file_name <- gsub("\\\\", "/", samples_stool$file_name)
samples_stool$fixed_file_name <- basename(samples_stool$fixed_file_name)
samples_stool$sample_name <- gsub(".mzML", "",samples_stool$fixed_file_name)

# Filter to only samples present in our dataset
samples_stool <- samples_stool %>% filter(samples_stool$sample_name %in% colnames(y.peptide$E))

We can attach our sample annotation data to the EList object for later analyses.

R

# Ensure sample row order matches column order of the peptide matrix
sample_info <- samples_stool[samples_stool$sample_name %in% colnames(y.peptide$E),]
rownames(sample_info) <- sample_info$sample_name
y.peptide$E <- y.peptide$E[,rownames(sample_info)]

# Add experimental metadata to EList
y.peptide$targets <- sample_info[, c('Class', 'Batch')]

Quality filtering

After reading in the data, it is common to conduct additional filtering to improve the quality of your data and assist with downstream interpretation of results.

Filtering out non-proteotypic peptides: Removes peptides that belong to more than one protein.
Filtering out compound proteins: Removes peptides belonging to compound protein groups consisting of multiple proteins separated by “;” delimiters.
Filtering out singleton peptides: Removes peptides that are the only peptide belonging to that protein (i.e. keeps only proteins with multiple peptides).

R

y.peptide <- filterNonProteotypicPeptides(y.peptide)
y.peptide <- filterCompoundProteins(y.peptide)
y.peptide <- filterSingletonPeptides(y.peptide, min.n.peptides = 2)

These filtering steps are optional; if they are not run, limpa will still be able to conduct protein quantification. For more information about these peptide-level filters and whether they are appropriate for your experiment, please consult the limpa documentation.

Removing contaminants

Contaminant background signals can influence mass spectrometry-based proteomics data. It is essential to remove contaminant proteins from the sample preparation process to avoid confounding results.

Frankenfield et al. (2022) created a new set of sample-type specific and universal contaminant FASTA files which they showed reduced false identifications, increased protein identifications and did not influence protein quantification for DIA workflows. These FASTA files updated upon widely used contaminant protein lists from MaxQuant and the common Repository of Adventitious Proteins (cRAP).

In the previous lesson, we incorporated the universal contaminants FASTA into our spectral library. This FASTA is structured such that all contaminant proteins are named beginning with ‘Cont_’, which means we can simply remove any proteins from our analysis containing this string text in their name.

R

# Remove contaminant proteins from the genes dataframe
y.peptide$genes <- y.peptide$genes %>% filter(!str_detect(Protein.Group, "Cont_"))

# Match peptides in the matrix to only those remaining in the dataframe
y.peptide$E <- y.peptide$E[rownames(y.peptide$genes), ]

Callout

More about contaminants

You can download the contaminant FASTA files from GitHub.

If you are analysing data that has been processed without a contaminant FASTA, you can also download an Excel spreadsheet contaminant list from the same link and use this to identify and remove contaminant proteins from your dataset.

For more information, see the original publication:

Frankenfield, A. M., Ni, J., Ahmed, M., & Hao, L. (2022). Protein Contaminants Matter: Building Universal Protein Contaminant Libraries for DDA and DIA Proteomics. Journal of proteome research, 21(9), 2104–2113. https://doi.org/10.1021/acs.jproteome.2c00145

Protein quantification

Missing data has been a longstanding issue impacting the accuracy and reproducibility of mass spectrometry-based proteomic analyses, complicating downstream analyses. Most established imputation methods are not designed to account for the specific attributes of proteomics data, and there is no consensus in the literature for how missing values should be managed in proteomics experiments. This presents an issue with using the protein values directly generated by platforms like DIA-NN, which contain many missing values.

In this tutorial, we are using the dpcQuant() method of protein quantification from the limpa package, or Linear Models for Proteomics Data (Li, Cobbold & Smyth, 2025). The detection probability curve (DPC) models the relationship between peptide intensity and missingness, showing that almost all missing values in label-free proteomics data are not missing at random and should be accounted for (Li & Smyth, 2023). The dpcQuant() uses Bayesian statistical methods based on peptide intensities and the DPC coefficients for a particular dataset to generate a complete matrix of protein values without the need for imputation. This method outputs a linear model object suitable for downstream analysis with the limma package.

To learn more about the detection probability curve and the limpa package, please read the original publications.

Using limpa and the DPC for protein quantification

First, we will generate and visualise our detection probability curve.

The DPC coefficients are related to the dataset and software used for processing. A steeper slope means there is more statistical information in the pattern of missing data. A slope closer to 0 means there is more data missing at random.

R

# Estimate the DPC for our data
dpcfit <- dpcON(y.peptide, robust=TRUE)

# Print the intercept and slope of our DPC
dpcfit$dpc

OUTPUT

     beta0      beta1
-7.2678650  0.6033267

R

# Plot the DPC
plotDPC(dpcfit)

Next, we can use the DPC to generate our log2 protein values. This function requires the following input:

a limma EList object, or a numeric matrix, of peptide-level log2 expression values, with samples as columns and peptides as rows (y.peptide)
vector of protein IDs ("Protein.Names")
numeric vector providing the slope and intercept of the DPC (dpcfit)

R

y.protein <- dpcQuant(y.peptide, "Protein.Names", dpc=dpcfit)

OUTPUT

Estimating hyperparameters ...

OUTPUT

Quantifying proteins ...

OUTPUT

Proteins: 280 Peptides: 2672

Note: our dataset is quite small, so this process should only take a few seconds. For larger datasets, it may take minutes or hours for this command to run.

Callout

Alternate versions of the DPC

The base dpc() assumes observed log-intensities are normally distributed.

A modified version, used in this tutorial, dpcON(robust=TRUE) downweighs outliers.

Alternatively, dpcCN() assumes the complete set of log-intensities are normally distributed. Using this method may underestimate the DPC slope if there is a non-small proportion of peptides missing by chance.

You can also choose your own slope by passing the argument dpc.slope= to dpcQuant() instead of dpc=dpcfit. Using this method, an intercept value will automatically be generated. You can check this value by using estimateDPCIntercept(). The authors recommend a slope of 0.6 - 0.8 is usually appropriate for data generated using DIA-NN.

You may wish to conduct further statistical analyses to determine which method is most appropriate for your experiment; however, the authors suggest that downstream analysis should not be significantly affected by differing DPC coefficients.

You can read more about the differences between each version of the DPC in the paper or the limpa manual.

Challenge

The limpa package also provides a function to impute missing peptide values and generate a complete peptide matrix using the DPC. Can you search the limpa documentation to identify the command used to impute and quantify missing values in a rowwise manner?

Show me the solution

dpcImpute()

dpcQuant() output

We have now generated an EList object for our protein values. Let’s investigate the other components of this object:

R

names(y.protein)

OUTPUT

[1] "E"       "genes"   "other"   "targets" "dpc"

R

names(y.protein$other)

OUTPUT

[1] "n.observations" "standard.error"

R

names(y.protein$genes)

OUTPUT

[1] "Protein.Group" "Protein.Names" "Genes"         "NPeptides"
[5] "PropObs"

EList component	Description
E	A complete matrix of log2 protein values with no missing data, with samples as columns and proteins as rows
genes	Feature annotation dataframe, with columns for: UniProt IDs (Protein.Group), proteins (Protein.Names), gene names (Genes), number of peptides used to quantify each protein (NPeptides), and the proportion of possible observations used to generate each protein value (PropObs)
other	List of two matrices with columns and rows corresponding to protein matrix E: number of peptides observed for each protein in each sample (n.observations), and the standard error associated with each protein value for each sample (standard.error)
targets	Sample annotation dataframe
dpc	DPC intercept and slope coefficient used for protein quantification

Challenge

Compare the first few rows of the protein (E), n.observations and standard.error matrices. Take a look at the NPeptides and PropObs values for those proteins also. What do you notice about the relationship between these quality metrics?

Show me the solution

Proteins with smaller NPeptides and PropObs values, and smaller n.observations for individual samples, have higher standard error.

The more peptide data available for a particular protein in a particular sample, the greater the confidence in the protein value, and therefore the lower the standard error.

If you are interested in a particular protein, it is important to look at these quality metrics for each of your samples.

Dimensionality Reduction and QC

Let’s take a look at our dataset. The plotMDSUsingSEs() function, or Multidimensional Scaling Plot of Gene Expression Profiles, Using Standard Errors, is similar to plotMDS() in the limma package, but takes account of the standard errors generated by dpcQuant(). This is essentially a PCA plot for the newly generated protein values that takes into consideration their standard error.

R

plotMDSUsingSEs(y.protein)

That plot is extremely messy and doesn’t tell us much! Let’s clean it up.

R

# Factorise and set colors for class and batch
Class <- factor(y.peptide$targets$Class)
Class.color <- Class
levels(Class.color) <- hcl.colors(nlevels(Class), palette = "cividis")

Batch <- factor(y.peptide$targets$Batch)
Batch.color <- Batch
levels(Batch.color) <- hcl.colors(nlevels(Batch), palette = "Set 2")

R

# Class visualisation
plotMDSUsingSEs(y.protein, pch=16, col=as.character(Class.color))

R

# Batch visualisation
plotMDSUsingSEs(y.protein, pch=16, col=as.character(Batch.color))

R

# Class (labels) and batch (colors) visualisation
plotMDSUsingSEs(y.protein, labels=Class, col=as.character(Batch.color))

We can already see a distinction between the aCD and Ctrl groups, which will be explored further in the next lesson.

It also looks like there are some differences between the batches. We should add batch as a covariate to our analyses in the next lesson.

Callout

Managing batch effects

Explaining the different methods of identifying, accounting and correcting for batch effects in proteomics data is an entire tutorial itself! For further information, we recommend the Melbourne Integrative Genomics Workshop: Managing batch effects in biological studies.

If you are concerned about batch effects in your data, University of Melbourne staff and students or affiliates can also contact us for a free consultation.

Normalisation

Normalisation is another key element of the proteomics data cleaning workflow. Normalisation techniques aim to remove non-biological variation in the data, for example, variation due to sample loading or batch effects.

Log transformation (which was automatically applied when we imported our peptide data) helps to normalise intensity values. The default RT-dependent normalisation applied during DIA-NN processing is also quite effective and often sufficient for DIA data; however, each experimental environment and techniques will have a different optimal approach. You may wish to apply an additional method of normalisation to your data.

To learn more about normalisation, keep an eye on the Melbourne Bioinformatics Eventbrite page for a normalisation workshop coming in 2026.

More about normalisation

To learn more about different normalisation techniques for proteomics data, we recommend the following papers:

Peng, H., Wang, H., Kong, W., Li, J., & Goh, W. W. B. (2024). Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference. Nature communications, 15(1), 3922. https://doi.org/10.1038/s41467-024-47899-w

Tseng, C. Y., Salguero, J. A., Breidenbach, J. D., Solomon, E., Sanders, C. K., Harvey, T., Thornhill, M. G., Palmisano, S. J., Sasiene, Z. J., Blackwell, B. R., McBride, E. M., Luchini, K. A., LeBrun, E. S., Alvarez, M., Mach, P. M., Rivera, E. S., & Glaros, T. G. (2025). Evaluation of normalization strategies for mass spectrometry-based multi-omics datasets. Metabolomics : Official journal of the Metabolomic Society, 21(4), 98. https://doi.org/10.1007/s11306-025-02297-1

Välikangas, T., Suomi, T., & Elo, L. L. (2018). A systematic evaluation of normalization methods in quantitative label-free proteomics. Briefings in bioinformatics, 19(1), 1–11. https://doi.org/10.1093/bib/bbw095

Good work! We now have a clean dataset ready for analysis.

Key Points

Key elements of the proteomics data cleaning workflow include: quality filtering, removing contaminants, protein quantification, managing batch effects, and normalisation.
We can use the limpa package in R to clean our peptide data and generate a complete, analysis-ready protein matrix with corresponding standard error values.