Algorithms - Alignment

MST Alignment Module

MST Alignment

class msproteomicstoolslib.algorithms.alignment.AlignmentMST.TreeConsensusAlignment(max_rt_diff, fdr_cutoff, aligned_fdr_cutoff, rt_diff_isotope=-1, correctRT_using_pg=False, stdev_max_rt_per_run=None, use_local_stdev=False, verbose=False)

Multiple run alignment using a minimum spanning tree (MST).

This class will align features across multiple runs using a strictly local approach. It uses a minimum spanning tree (MST) as input which is expected to allow traversal of all runs, only connecting the most similar runs. This should allow accurate local alignment between two very similar runs and transfer of identification with high accuracy. Specifically, this should for scalability to a large number of dissimilar runs where approaches that rely on a single reference run for alignment might give less accurate results.

Briefly, the algorithm will choose the best scoring peakgroup as a seed and start to traverse the MST from this seed. At each node, it will add the best matching peakgroup (by score, within a specified retention time window) to the result. After traversing all nodes, a new seed can be chosen among the peakgroups not yet belonging to a cluster and the process can be repeated to produce multiple clusters.

For example, consider a case of 5 LC-MS/MS runs where 6 different feature (peakgroups) were found in each run (not all peakgroups were found in all runs):

Run 1:
---------- pg1_1 ------- pg1_2 --- pg1_3 - pg1_4 --------- pg1_5 ------- pg1_6

Run 2:
----------- pg2_1 ------- pg2_2 --- pg2_3 - pg2_4 --------- pg2_5 ------ pg2_6

Run 3:
----- pg3_0 ---------- pg3_1 ---- pg3_2 - pg3_3 - pg3_4 ----- pg3_5 ---- pg3_6

Run 4:
----------- pg4_0 ---------- pg4_1 ------- pg4_2 --- pg4_3 - pg4_4 --------- pg4_5

Run 5:
-- pg5_0 ---------- pg5_1 ------- pg5_2 --- pg5_3 - pg5_4 --------- pg5_5

Assume that the corresponding MST looks like this:

                       /-- Run4
Run1 -- Run2 -- Run3 --
                       \-- Run5

This is a case where Run1 and Run2 are very similar and Run3 and Run4 are rather similar and should be easy to align. The algorithm will start with the “best” peakgroup overall (having the best probability score), assume this peakgroups is pg1_1 from Run 1. The algorithm will then use the alignment Run1-Run2 to infer that pg2_1 is the same signal as pg1_1 and add it to the group. Specifically, it will select the highest-scoring peakgroup within a narrow RT-window (max_rt_diff) in Run2 - note that if the RT-window is too wide, there is a certain chance of mis-matching, e.g. pg2_2 will be selected instead of pg2_1. The alignment Run2-Run3 will be used to add pg3_1, assuming that the RT transformation and the scoring both indicate that it should be selected. Note how a null RT transformation will actually prefer pg3_0 as there is a RT shift between Run 2 and 3. So a too narrow RT-window (max_rt_diff) may lead to the wrong peak group being selected. This can be rectified with a larger RT tolerance or by using a suitable RT transformation (linear, lowess etc) which will lead to the selection of the correct peak group pg3_1.

Then a bifurcation in the tree occurs and Run3-Run4 as well as Run3-Run5 will be used to infer the identity of pg4_1 and pg5_1 and add them to the cluster. In the end, the algorithm will report (pg1_1, pg2_1, pg3_1, pg4_1, pg5_1) as a consistent cluster across multiple runs. This process can be repeated with the next best peakgroup that is not yet part of a cluster (e.g. pg1_2) until no more peakgroups are left (no more peakgroups having a score below fdr_cutoff).

Note how the algorithm only used binary alignments and purely local alignments of the runs that are most close to each other. This stands in contrast to approaches where a single reference is picked and then used for alignment which might align runs that are substantially different. On the other hand, a single error at one edge in the tree will propagate itself and could lead to whole subtrees that are wrongly aligned (e.g. if at one point pg3_0 got selected instead of pg3_1 it is likely that in the following steps, pg4_0 and not pg4_1 will be selected).

__init__(max_rt_diff, fdr_cutoff, aligned_fdr_cutoff, rt_diff_isotope=-1, correctRT_using_pg=False, stdev_max_rt_per_run=None, use_local_stdev=False, verbose=False)

Initialization with parameters

Parameters:
  • max_rt_diff (float) – maximal difference in retention time to be used to look for a matching peakgroup in an adjacent run
  • fdr_cutoff (float) – maximal FDR that at least one peakgroup needs to reach (seed FDR)
  • aligned_fdr_cutoff (float) – maximal FDR that a peakgroup needs to reach to be considered for extension (extension FDR)
  • rt_diff_isotope (float) – maximal difference in retention time between two isotopic pairs or precursors with the same charge state
  • correctRT_using_pg (bool) – use the apex of the aligned peak group as the input for the next alignment during MST traversal (opposed to using the transformed RT plain)
  • stdev_max_rt_per_run (float) – use a different maximal RT tolerance for each alignment, depending on the goodness of the alignment. The RT tolerance used by the algorithm will be the standard deviation times stdev_max_rt_per_run.
  • use_local_stdev (float) – use RT-local standard deviation (experimental)
__module__ = 'msproteomicstoolslib.algorithms.alignment.AlignmentMST'
alignAllCluster(multipeptides, tree, tr_data)

Use the MST to report all cluster.

Briefly, the algorithm will choose the best scoring peakgroup as a seed and start to traverse the MST from this seed. At each node, it will add the best matching peakgroup (by score, within a specified retention time window) to the result. After traversing all nodes, a new seed can be chosen among the peakgroups not yet belonging to a cluster and the process can be repeated to produce multiple clusters. It will add clusters until no more peptides with an fdr score better than self._fdr_cutoff are left.

Parameters:
  • multipeptides (list(Multipeptide) – a list of multipeptides on which the alignment should be performed. After alignment, each peakgroup that should be quantified can be retrieved by calling get_selected_peakgroups() on the multipeptide.
  • tree (list(tuple) – a minimum spanning tree (MST) represented as list of edges (for example [(‘0’, ‘1’), (‘1’, ‘2’)] ). Node names need to correspond to run ids.
  • tr_data (LightTransformationData) – structure to hold binary transformations between two different retention time spaces
Returns:

None

alignBestCluster(multipeptides, tree, tr_data)

Use the MST to report the first cluster containing the best peptide (overall).

The algorithm will go through all multipeptides and mark those peakgroups which it deems to belong to the best peakgroup cluster (only the first cluster will be reported).

Parameters:
  • multipeptides (list of Multipeptide) – a list of multipeptides on which the alignment should be performed. After alignment, each peakgroup that should be quantified can be retrieved by calling get_selected_peakgroups() on the multipeptide.
  • tree (list of tuple) – a minimum spanning tree (MST) represented as list of edges (for example [(‘0’, ‘1’), (‘1’, ‘2’)] ). Node names need to correspond to run ids.
  • tr_data (format.TransformationCollection.LightTransformationData) – structure to hold binary transformations between two different retention time spaces
Returns:

None

alignBestCluster_legacy(multipeptides, tree, tr_data)
msproteomicstoolslib.algorithms.alignment.AlignmentMST.getDistanceMatrix(exp, multipeptides, spl_aligner, singleRowId=None)

Compute distance matrix of all runs.

Computes a n x n distance matrix between all runs of an experiment. The reported distance is 1 minus the Rsquared value (1-R^2) from the linear regression.

Parameters:
  • exp (MRExperiment) – a collection of runs
  • multipeptides (list(Multipeptide) – a list of multipeptides containing the matching of precursors across runs.
  • initial_alignment_cutoff (float) – a filtering cutoff (in q-value) to specify which points should be used for the calculation of the distance. In general, only identification which are very certain should be used for this and a q-value of 0.0001 is recommended – given that there are enough points.
Returns:

numpy – distance matrix

Return type:

n x n

Simple Alignment Algorithm

class msproteomicstoolslib.algorithms.alignment.AlignmentAlgorithm.Cluster(peakgroups)

A helper class representation of a cluster (used in AlignmentAlgorithm)

getMedianRT()

Calculate the median retention time of a cluster

getRTstd()

Calculate the standard deviation of the retention times

getTotalScore()

Calculate the total score of a cluster (multiplication of probabilities)

select_one_per_run(verbose=False)

Ensure that only one peakgroup is selected per run.

If there are multiple peakgroups selected, only the best one is retained.

class msproteomicstoolslib.algorithms.alignment.AlignmentAlgorithm.AlignmentAlgorithm

A class of alignment algorithms

align_features(multipeptides, rt_diff_cutoff, fdr_cutoff, aligned_fdr_cutoff, method='best_overall')

Perform the alignment on a set of multipeptides

Parameters:
  • multipeptides (list(Multipeptide) – a list of multipeptides on which the alignment should be performed. After alignment, each peakgroup that should be quantified can be retrieved by calling get_selected_peakgroups() on the multipeptide.
  • rt_diff_cutoff (float) – maximal allowed RT difference used in the clustering or the
  • fdr_cutoff (float) – FDR cutoff for “seeds” (e.g. the “known good” cutoff, for example 0.01)
  • aligned_fdr_cutoff (float) – maximal FDR cutoff to still be considered for the clustering
  • method (String) – either best_overall or best_cluster_score (or global_best_cluster_score, global_best_overall)
Returns:

Alignment object with alignment statistics

SplineAligner

class msproteomicstoolslib.algorithms.alignment.SplineAligner.SplineAligner(alignment_fdr_threshold=0.0001, smoother='lowess', external_r_tmpdir=None, maxdata=-1, experiment=None)

Use the datasmoothing part of msproteomicstoolslib to align two runs in retention times using splines.

>>> spl_aligner = SplineAligner()
>>> transformations = spl_aligner.rt_align_all_runs(this_exp, multipeptides, options.alignment_score, options.use_scikit)
getTransformationError()

Get the error of the transformation

Returns:transformation_error – the error of the transformation
Return type:TransformationError
rt_align_all_runs(experiment, multipeptides)

Align all runs contained in an MRExperiment

Parameters:
  • experiment (MRExperiment) – a collection of runs
  • multipeptides (list(multipeptides) – a list of Multipeptide derived from the above expriment
class msproteomicstoolslib.algorithms.alignment.SplineAligner.TransformationError

Class to store the error of a given transformation

getStdev()

Retrieve the standard deviation as a measurement of the error of a transformation

BorderIntegration

msproteomicstoolslib.algorithms.alignment.BorderIntegration.integrationBorderShortestPath(selected_pg, target_run, transformation_collection_, tree)

Determine the optimal integration border by using the shortest path in the MST

Parameters:
  • selected_pg (list(GeneralPeakGroup) – list of selected peakgroups (e.g. those passing the quality threshold)
  • target_run (String) – run id of the target run (where value is missing)
  • transformation_collection (LightTransformationData) – structure to hold binary transformations between two different retention time spaces
  • tree (list(tuple) – a minimum spanning tree (MST) represented as list of edges (for example [(‘0’, ‘1’), (‘1’, ‘2’)] ). Node names need to correspond to run ids.
Returns:

A tuple of (left_integration_border, right_integration_border)

msproteomicstoolslib.algorithms.alignment.BorderIntegration.integrationBorderShortestDistance(selected_pg, target_run, transformation_collection_, mat, rmap)

Determine the optimal integration border by using the shortest distance (direct transformation)

Parameters:
  • selected_pg (list(GeneralPeakGroup) – list of selected peakgroups (e.g. those passing the quality threshold)
  • target_run (String) – run id of the target run (where value is missing)
  • transformation_collection (LightTransformationData) – structure to hold binary transformations between two different retention time spaces
  • mat (numpy matrix) – distance matrix for all runs (as returned by algorithms.alignment.AlignmentMST.getDistanceMatrix)
  • rmap (dict) – mapping run ids to matrix columns (e.g. {“run_0” : 0, “run_1” : 1})
Returns:

A tuple of (left_integration_border, right_integration_border)

msproteomicstoolslib.algorithms.alignment.BorderIntegration.integrationBorderReference(new_exp, selected_pg, rid, transformation_collection_, border_option)

Determine the optimal integration border by taking the mean of all other peakgroup boundaries using a reference run.

Parameters:
  • new_exp (AlignmentExperiment) – experiment containing the aligned peakgroups
  • selected_pg (list(GeneralPeakGroup) – list of selected peakgroups (e.g. those passing the quality threshold)
  • rid (String) – current run id
  • transformation_collection (TransformationCollection) – specifying how to transform between retention times of different runs
  • border_option (String) – one of the following options (“mean”, “median” “max_width”), determining how to aggregate multiple peak boundary information
Returns:

A tuple of (left_integration_border, right_integration_border) in the retention time space of the _reference_ run

Alignment Datastructures Module

Multipeptide

class msproteomicstoolslib.algorithms.alignment.Multipeptide.Multipeptide

Bases: object

A collection of the same precursors (chromatograms) across multiple runs.

It contains individual precursors that can be accessed by their run id.

all_above_cutoff(cutoff)
all_selected()

Returns True if all peakgroups are selected

find_best_peptide_pg()

Find best peakgroup across all peptides

getAllPeptides()
getPrecursorGroup(runid)

Get precursor group for the given run

:param : :type : param str runid: Run id of the group :param : :type : rtype: PrecursorGroup: Precursor group from the corresponding run

getPrecursorGroups()

Get all precursor groups

Return type:list(PrecursorGroup): All Precursor group from the corresponding run
get_decoy()

Whether the current peptide is a decoy or not

Return type:bool: Whether the peptide is decoy or not
get_id()
get_nr_runs()
get_peptide(runid)
get_peptides()
get_selected_peakgroups()

Get all peakgroups that were selected across all runs and precursor groups

hasPrecursorGroup(runid)

Checks whether a given run has a precursor group

Parameters:runid (str) – Run id to check
Return type:bool: Whether the given run has a precursor group
has_null_peptides()

Whether there are runs in which no peptide was detected (peptide is Null)

Returns:has_null – Whether there are Null peptides in this object (not detected in some runs)
Return type:bool
has_peptide(runid)
insert(runid, precursor_group)

Insert a PrecursorGroup into the Multipeptide

Parameters:
  • runid (str) – Run id of the group
  • precursor_group (PrecursorGroup) – Precursor group to be inserted
Raises:

Exception – If self.hasPrecursorGroup(runid) is true

more_than_fraction_selected(fraction)
set_nr_runs(v)

MRExperiment

class msproteomicstoolslib.algorithms.alignment.MRExperiment.MRExperiment

Bases: object

An MR (multirun) Experiment is a container for multiple experimental runs.

In some of the runs the same peptidde precursors may be identified and the job of this object is to keep track of these experiments and the identified precursors across multiple runs.

Example usage:

>>> # Read the files
>>> fdr_cutoff = 0.01 # 1% FDR
>>> reader = SWATHScoringReader.newReader(infiles, "openswath")
>>> this_exp = Experiment()
>>> this_exp.set_runs( reader.parse_files(options.realign_runs) )
>>> multipeptides = this_exp.get_all_multipeptides(fdr_cutoff)
get_all_multipeptides(fdr_cutoff, verbose=False, verbosity=0)

Match all precursors in different runs to each other.

Find all precursors that are above the fdr cutoff in each run and build a union of those precursors. Then search for each of those precursors in all the other runs and build a multipeptide / multiprecursor.

Parameters:
  • fdr_cutoff (float) – A cutoff in fdr (between 0 and 1) to use for the alignment. Each generated Multipeptide needs to have at least one member who is below the cutoff.
  • verbose (bool) – Whether to be verbose or not
  • verbosity (int) – How verbose to be
set_runs(runs)

Initialize with a set of runs.

Parameters:runs (list of Run) – A set of runs

Alignment Helper Module

FDREstimation

class msproteomicstoolslib.algorithms.alignment.FDRParameterEstimation.ParamEst(min_runs=1, max_recursion=10, verbose=False)

Bases: object

Parameter estimation object

In a first step the percentage of decoys of all peakgroups at the target fdr is computed (which is then taken as the “aim”). For this “aim” of decoy percentage, the class will try to estimate an fdr_cutoff such that the percentage of decoy precursors in the final reported result will correspond to the “aim”.

If the parameter min_runs (at initialization) is higher than 1, only precursors that are identified in min_runs above the fdr_cutoff will be reported.

>>> p = ParamEst()
>>> decoy_frac = p.compute_decoy_frac(multipeptides, target_fdr)
>>> fdr_cutoff_calculated = p.find_iterate_fdr(multipeptides, decoy_frac)
compute_decoy_frac(multipeptides, target_fdr)

Calculate how many of the peakgroups are decoy for a given cutoff.

This provides an estimator of the ration of false positives to decoy peak groups which the mProphet / pyProphet algorithm has previously estimated using the Storey-Tibshirani method.

find_iterate_fdr(multipeptides, decoy_frac, recursion=0)

Iteratively find an q-value cutoff to reach the specified decoy fraction

This function will step through multiple q-value thresholds and evaluate how many peptides have at least one peakgroup whose q-value (using get_fdr_score) is below that threshold. The q-value is then adapted until the fraction of decoys in the result is equal to the specified fraction given as input.

Parameters:
  • multipeptides (list(Multipeptide) – a list of multipeptides on which the alignment should be performed. After alignment, each peakgroup that should be quantified can be retrieved by calling get_selected_peakgroups() on the multipeptide.
  • decoy_frac (float) – fraction of decoys that should be present in the result (e.g. 0.01)
  • recursion (int) – recursion level (should be set to zero)
Returns:

None

msproteomicstoolslib.algorithms.alignment.AlignmentHelper.write_out_matrix_file(matrix_outfile, allruns, multipeptides, fraction_needed_selected, style='none', write_requant=True, aligner_mscore_treshold=1.0)
msproteomicstoolslib.algorithms.alignment.AlignmentHelper.addDataToTrafo(tr_data, run_0, run_1, spl_aligner, multipeptides, realign_method, max_rt_diff, topN=5, sd_max_data_length=5000, force=False)