Algorithms - Alignment¶
MST Alignment
Module¶
MST Alignment¶
-
class
msproteomicstoolslib.algorithms.alignment.AlignmentMST.
TreeConsensusAlignment
(max_rt_diff, fdr_cutoff, aligned_fdr_cutoff, rt_diff_isotope=-1, correctRT_using_pg=False, stdev_max_rt_per_run=None, use_local_stdev=False, verbose=False)¶ Multiple run alignment using a minimum spanning tree (MST).
This class will align features across multiple runs using a strictly local approach. It uses a minimum spanning tree (MST) as input which is expected to allow traversal of all runs, only connecting the most similar runs. This should allow accurate local alignment between two very similar runs and transfer of identification with high accuracy. Specifically, this should for scalability to a large number of dissimilar runs where approaches that rely on a single reference run for alignment might give less accurate results.
Briefly, the algorithm will choose the best scoring peakgroup as a seed and start to traverse the MST from this seed. At each node, it will add the best matching peakgroup (by score, within a specified retention time window) to the result. After traversing all nodes, a new seed can be chosen among the peakgroups not yet belonging to a cluster and the process can be repeated to produce multiple clusters.
For example, consider a case of 5 LC-MS/MS runs where 6 different feature (peakgroups) were found in each run (not all peakgroups were found in all runs):
Run 1: ---------- pg1_1 ------- pg1_2 --- pg1_3 - pg1_4 --------- pg1_5 ------- pg1_6 Run 2: ----------- pg2_1 ------- pg2_2 --- pg2_3 - pg2_4 --------- pg2_5 ------ pg2_6 Run 3: ----- pg3_0 ---------- pg3_1 ---- pg3_2 - pg3_3 - pg3_4 ----- pg3_5 ---- pg3_6 Run 4: ----------- pg4_0 ---------- pg4_1 ------- pg4_2 --- pg4_3 - pg4_4 --------- pg4_5 Run 5: -- pg5_0 ---------- pg5_1 ------- pg5_2 --- pg5_3 - pg5_4 --------- pg5_5
Assume that the corresponding MST looks like this:
/-- Run4 Run1 -- Run2 -- Run3 -- \-- Run5
This is a case where Run1 and Run2 are very similar and Run3 and Run4 are rather similar and should be easy to align. The algorithm will start with the “best” peakgroup overall (having the best probability score), assume this peakgroups is
pg1_1
from Run 1. The algorithm will then use the alignment Run1-Run2 to infer thatpg2_1
is the same signal aspg1_1
and add it to the group. Specifically, it will select the highest-scoring peakgroup within a narrow RT-window (max_rt_diff) in Run2 - note that if the RT-window is too wide, there is a certain chance of mis-matching, e.g.pg2_2
will be selected instead ofpg2_1
. The alignment Run2-Run3 will be used to addpg3_1
, assuming that the RT transformation and the scoring both indicate that it should be selected. Note how a null RT transformation will actually preferpg3_0
as there is a RT shift between Run 2 and 3. So a too narrow RT-window (max_rt_diff) may lead to the wrong peak group being selected. This can be rectified with a larger RT tolerance or by using a suitable RT transformation (linear, lowess etc) which will lead to the selection of the correct peak grouppg3_1
.Then a bifurcation in the tree occurs and Run3-Run4 as well as Run3-Run5 will be used to infer the identity of
pg4_1
andpg5_1
and add them to the cluster. In the end, the algorithm will report (pg1_1
,pg2_1
,pg3_1
,pg4_1
,pg5_1
) as a consistent cluster across multiple runs. This process can be repeated with the next best peakgroup that is not yet part of a cluster (e.g.pg1_2
) until no more peakgroups are left (no more peakgroups having a score below fdr_cutoff).Note how the algorithm only used binary alignments and purely local alignments of the runs that are most close to each other. This stands in contrast to approaches where a single reference is picked and then used for alignment which might align runs that are substantially different. On the other hand, a single error at one edge in the tree will propagate itself and could lead to whole subtrees that are wrongly aligned (e.g. if at one point
pg3_0
got selected instead ofpg3_1
it is likely that in the following steps,pg4_0
and notpg4_1
will be selected).-
__init__
(max_rt_diff, fdr_cutoff, aligned_fdr_cutoff, rt_diff_isotope=-1, correctRT_using_pg=False, stdev_max_rt_per_run=None, use_local_stdev=False, verbose=False)¶ Initialization with parameters
Parameters: - max_rt_diff (float) – maximal difference in retention time to be used to look for a matching peakgroup in an adjacent run
- fdr_cutoff (float) – maximal FDR that at least one peakgroup needs to reach (seed FDR)
- aligned_fdr_cutoff (float) – maximal FDR that a peakgroup needs to reach to be considered for extension (extension FDR)
- rt_diff_isotope (float) – maximal difference in retention time between two isotopic pairs or precursors with the same charge state
- correctRT_using_pg (bool) – use the apex of the aligned peak group as the input for the next alignment during MST traversal (opposed to using the transformed RT plain)
- stdev_max_rt_per_run (float) – use a different maximal RT tolerance for each alignment, depending on the goodness of the alignment. The RT tolerance used by the algorithm will be the standard deviation times stdev_max_rt_per_run.
- use_local_stdev (float) – use RT-local standard deviation (experimental)
-
__module__
= 'msproteomicstoolslib.algorithms.alignment.AlignmentMST'¶
-
alignAllCluster
(multipeptides, tree, tr_data)¶ Use the MST to report all cluster.
Briefly, the algorithm will choose the best scoring peakgroup as a seed and start to traverse the MST from this seed. At each node, it will add the best matching peakgroup (by score, within a specified retention time window) to the result. After traversing all nodes, a new seed can be chosen among the peakgroups not yet belonging to a cluster and the process can be repeated to produce multiple clusters. It will add clusters until no more peptides with an fdr score better than self._fdr_cutoff are left.
Parameters: - multipeptides (list(Multipeptide) – a list of multipeptides on which the alignment should be performed. After alignment, each peakgroup that should be quantified can be retrieved by calling get_selected_peakgroups() on the multipeptide.
- tree (list(tuple) – a minimum spanning tree (MST) represented as list of edges (for example [(‘0’, ‘1’), (‘1’, ‘2’)] ). Node names need to correspond to run ids.
- tr_data (
LightTransformationData
) – structure to hold binary transformations between two different retention time spaces
Returns: None
-
alignBestCluster
(multipeptides, tree, tr_data)¶ Use the MST to report the first cluster containing the best peptide (overall).
The algorithm will go through all multipeptides and mark those peakgroups which it deems to belong to the best peakgroup cluster (only the first cluster will be reported).
Parameters: - multipeptides (list of
Multipeptide
) – a list of multipeptides on which the alignment should be performed. After alignment, each peakgroup that should be quantified can be retrieved by calling get_selected_peakgroups() on the multipeptide. - tree (list of tuple) – a minimum spanning tree (MST) represented as list of edges (for example [(‘0’, ‘1’), (‘1’, ‘2’)] ). Node names need to correspond to run ids.
- tr_data (
format.TransformationCollection.LightTransformationData
) – structure to hold binary transformations between two different retention time spaces
Returns: None
- multipeptides (list of
-
alignBestCluster_legacy
(multipeptides, tree, tr_data)¶
-
-
msproteomicstoolslib.algorithms.alignment.AlignmentMST.
getDistanceMatrix
(exp, multipeptides, spl_aligner, singleRowId=None)¶ Compute distance matrix of all runs.
Computes a n x n distance matrix between all runs of an experiment. The reported distance is 1 minus the Rsquared value (1-R^2) from the linear regression.
Parameters: - exp (MRExperiment) – a collection of runs
- multipeptides (list(Multipeptide) – a list of multipeptides containing the matching of precursors across runs.
- initial_alignment_cutoff (float) – a filtering cutoff (in q-value) to specify which points should be used for the calculation of the distance. In general, only identification which are very certain should be used for this and a q-value of 0.0001 is recommended – given that there are enough points.
Returns: numpy – distance matrix
Return type: n x n
Simple Alignment Algorithm¶
-
class
msproteomicstoolslib.algorithms.alignment.AlignmentAlgorithm.
Cluster
(peakgroups)¶ A helper class representation of a cluster (used in AlignmentAlgorithm)
-
getMedianRT
()¶ Calculate the median retention time of a cluster
-
getRTstd
()¶ Calculate the standard deviation of the retention times
-
getTotalScore
()¶ Calculate the total score of a cluster (multiplication of probabilities)
-
select_one_per_run
(verbose=False)¶ Ensure that only one peakgroup is selected per run.
If there are multiple peakgroups selected, only the best one is retained.
-
-
class
msproteomicstoolslib.algorithms.alignment.AlignmentAlgorithm.
AlignmentAlgorithm
¶ A class of alignment algorithms
-
align_features
(multipeptides, rt_diff_cutoff, fdr_cutoff, aligned_fdr_cutoff, method='best_overall')¶ Perform the alignment on a set of multipeptides
Parameters: - multipeptides (list(Multipeptide) – a list of multipeptides on which the alignment should be performed. After alignment, each peakgroup that should be quantified can be retrieved by calling get_selected_peakgroups() on the multipeptide.
- rt_diff_cutoff (float) – maximal allowed RT difference used in the clustering or the
- fdr_cutoff (float) – FDR cutoff for “seeds” (e.g. the “known good” cutoff, for example 0.01)
- aligned_fdr_cutoff (float) – maximal FDR cutoff to still be considered for the clustering
- method (String) – either best_overall or best_cluster_score (or global_best_cluster_score, global_best_overall)
Returns: Alignment object with alignment statistics
-
SplineAligner¶
-
class
msproteomicstoolslib.algorithms.alignment.SplineAligner.
SplineAligner
(alignment_fdr_threshold=0.0001, smoother='lowess', external_r_tmpdir=None, maxdata=-1, experiment=None)¶ Use the datasmoothing part of msproteomicstoolslib to align two runs in retention times using splines.
>>> spl_aligner = SplineAligner() >>> transformations = spl_aligner.rt_align_all_runs(this_exp, multipeptides, options.alignment_score, options.use_scikit)
-
getTransformationError
()¶ Get the error of the transformation
Returns: transformation_error – the error of the transformation Return type: TransformationError
-
rt_align_all_runs
(experiment, multipeptides)¶ Align all runs contained in an MRExperiment
Parameters: - experiment (MRExperiment) – a collection of runs
- multipeptides (list(multipeptides) – a list of Multipeptide derived from the above expriment
-
BorderIntegration¶
-
msproteomicstoolslib.algorithms.alignment.BorderIntegration.
integrationBorderShortestPath
(selected_pg, target_run, transformation_collection_, tree)¶ Determine the optimal integration border by using the shortest path in the MST
Parameters: - selected_pg (list(GeneralPeakGroup) – list of selected peakgroups (e.g. those passing the quality threshold)
- target_run (String) – run id of the target run (where value is missing)
- transformation_collection (
LightTransformationData
) – structure to hold binary transformations between two different retention time spaces - tree (list(tuple) – a minimum spanning tree (MST) represented as list of edges (for example [(‘0’, ‘1’), (‘1’, ‘2’)] ). Node names need to correspond to run ids.
Returns: A tuple of (left_integration_border, right_integration_border)
-
msproteomicstoolslib.algorithms.alignment.BorderIntegration.
integrationBorderShortestDistance
(selected_pg, target_run, transformation_collection_, mat, rmap)¶ Determine the optimal integration border by using the shortest distance (direct transformation)
Parameters: - selected_pg (list(GeneralPeakGroup) – list of selected peakgroups (e.g. those passing the quality threshold)
- target_run (String) – run id of the target run (where value is missing)
- transformation_collection (
LightTransformationData
) – structure to hold binary transformations between two different retention time spaces - mat (numpy matrix) – distance matrix for all runs (as returned by algorithms.alignment.AlignmentMST.getDistanceMatrix)
- rmap (dict) – mapping run ids to matrix columns (e.g. {“run_0” : 0, “run_1” : 1})
Returns: A tuple of (left_integration_border, right_integration_border)
-
msproteomicstoolslib.algorithms.alignment.BorderIntegration.
integrationBorderReference
(new_exp, selected_pg, rid, transformation_collection_, border_option)¶ Determine the optimal integration border by taking the mean of all other peakgroup boundaries using a reference run.
Parameters: - new_exp (AlignmentExperiment) – experiment containing the aligned peakgroups
- selected_pg (list(GeneralPeakGroup) – list of selected peakgroups (e.g. those passing the quality threshold)
- rid (String) – current run id
- transformation_collection (
TransformationCollection
) – specifying how to transform between retention times of different runs - border_option (String) – one of the following options (“mean”, “median” “max_width”), determining how to aggregate multiple peak boundary information
Returns: A tuple of (left_integration_border, right_integration_border) in the retention time space of the _reference_ run
Alignment Datastructures
Module¶
Multipeptide¶
-
class
msproteomicstoolslib.algorithms.alignment.Multipeptide.
Multipeptide
¶ Bases:
object
A collection of the same precursors (chromatograms) across multiple runs.
It contains individual precursors that can be accessed by their run id.
-
all_above_cutoff
(cutoff)¶
-
all_selected
()¶ Returns True if all peakgroups are selected
-
find_best_peptide_pg
()¶ Find best peakgroup across all peptides
-
getAllPeptides
()¶
-
getPrecursorGroup
(runid)¶ Get precursor group for the given run
:param : :type : param str runid: Run id of the group :param : :type : rtype:
PrecursorGroup
: Precursor group from the corresponding run
-
getPrecursorGroups
()¶ Get all precursor groups
Return type: list( PrecursorGroup
): All Precursor group from the corresponding run
-
get_decoy
()¶ Whether the current peptide is a decoy or not
Return type: bool: Whether the peptide is decoy or not
-
get_id
()¶
-
get_nr_runs
()¶
-
get_peptide
(runid)¶
-
get_peptides
()¶
-
get_selected_peakgroups
()¶ Get all peakgroups that were selected across all runs and precursor groups
-
hasPrecursorGroup
(runid)¶ Checks whether a given run has a precursor group
Parameters: runid (str) – Run id to check Return type: bool: Whether the given run has a precursor group
-
has_null_peptides
()¶ Whether there are runs in which no peptide was detected (peptide is Null)
Returns: has_null – Whether there are Null peptides in this object (not detected in some runs) Return type: bool
-
has_peptide
(runid)¶
-
insert
(runid, precursor_group)¶ Insert a
PrecursorGroup
into the MultipeptideParameters: - runid (str) – Run id of the group
- precursor_group (
PrecursorGroup
) – Precursor group to be inserted
Raises: Exception
– If self.hasPrecursorGroup(runid) is true
-
more_than_fraction_selected
(fraction)¶
-
set_nr_runs
(v)¶
-
MRExperiment¶
-
class
msproteomicstoolslib.algorithms.alignment.MRExperiment.
MRExperiment
¶ Bases:
object
An MR (multirun) Experiment is a container for multiple experimental runs.
In some of the runs the same peptidde precursors may be identified and the job of this object is to keep track of these experiments and the identified precursors across multiple runs.
Example usage:
>>> # Read the files >>> fdr_cutoff = 0.01 # 1% FDR >>> reader = SWATHScoringReader.newReader(infiles, "openswath") >>> this_exp = Experiment() >>> this_exp.set_runs( reader.parse_files(options.realign_runs) ) >>> multipeptides = this_exp.get_all_multipeptides(fdr_cutoff)
-
get_all_multipeptides
(fdr_cutoff, verbose=False, verbosity=0)¶ Match all precursors in different runs to each other.
Find all precursors that are above the fdr cutoff in each run and build a union of those precursors. Then search for each of those precursors in all the other runs and build a multipeptide / multiprecursor.
Parameters: - fdr_cutoff (float) – A cutoff in fdr (between 0 and 1) to use for the alignment. Each generated Multipeptide needs to have at least one member who is below the cutoff.
- verbose (bool) – Whether to be verbose or not
- verbosity (int) – How verbose to be
-
Alignment Helper
Module¶
FDREstimation¶
-
class
msproteomicstoolslib.algorithms.alignment.FDRParameterEstimation.
ParamEst
(min_runs=1, max_recursion=10, verbose=False)¶ Bases:
object
Parameter estimation object
In a first step the percentage of decoys of all peakgroups at the target fdr is computed (which is then taken as the “aim”). For this “aim” of decoy percentage, the class will try to estimate an fdr_cutoff such that the percentage of decoy precursors in the final reported result will correspond to the “aim”.
If the parameter min_runs (at initialization) is higher than 1, only precursors that are identified in min_runs above the fdr_cutoff will be reported.
>>> p = ParamEst() >>> decoy_frac = p.compute_decoy_frac(multipeptides, target_fdr) >>> fdr_cutoff_calculated = p.find_iterate_fdr(multipeptides, decoy_frac)
-
compute_decoy_frac
(multipeptides, target_fdr)¶ Calculate how many of the peakgroups are decoy for a given cutoff.
This provides an estimator of the ration of false positives to decoy peak groups which the mProphet / pyProphet algorithm has previously estimated using the Storey-Tibshirani method.
-
find_iterate_fdr
(multipeptides, decoy_frac, recursion=0)¶ Iteratively find an q-value cutoff to reach the specified decoy fraction
This function will step through multiple q-value thresholds and evaluate how many peptides have at least one peakgroup whose q-value (using get_fdr_score) is below that threshold. The q-value is then adapted until the fraction of decoys in the result is equal to the specified fraction given as input.
Parameters: - multipeptides (list(Multipeptide) – a list of multipeptides on which the alignment should be performed. After alignment, each peakgroup that should be quantified can be retrieved by calling get_selected_peakgroups() on the multipeptide.
- decoy_frac (float) – fraction of decoys that should be present in the result (e.g. 0.01)
- recursion (int) – recursion level (should be set to zero)
Returns: None
-
-
msproteomicstoolslib.algorithms.alignment.AlignmentHelper.
write_out_matrix_file
(matrix_outfile, allruns, multipeptides, fraction_needed_selected, style='none', write_requant=True, aligner_mscore_treshold=1.0)¶
-
msproteomicstoolslib.algorithms.alignment.AlignmentHelper.
addDataToTrafo
(tr_data, run_0, run_1, spl_aligner, multipeptides, realign_method, max_rt_diff, topN=5, sd_max_data_length=5000, force=False)¶