GSOC Week 3: TOPP Tool Coding
This week was all about writting the database suitability TOPP tool. Starting with the core body and input parameters to coding the metric algorithms and designing an output.
A quick reminder on how the two metrics will work:
1. Database Suitability
- calculate a decoy cut-off
- get cross correlation score (xcorr) from the top two decoy hits
- normalize their difference by molecular weight (mw)
- count number of target-de-novo hits and number of target-database hits
- if top hit is target-de-novo, check if second hit is database and has a score difference less/equal to the decoy cut-off → count this as a database hit
- ratio of #db_hits/(#db_hits + #de_novo_hits) gives the database suitability
2. Spectral Quality
- ratio of #de_novo_peptides/#ms2_spectra gives the spectral quality
- this is basicly the ms2 id rate but only for de novo peptides
To implement a new TOPP tool into OpenMS one just needs to follow their guide. This was pretty straight forward and there's not much to talk about.
Input parameters will be:
- idXML file from the peptide search (post FDR) - needed for database suitability
- idXML file containing de novo peptides - needed for spectral quality
- mzML file containing the original spectra - needed for spectral quality
In addition to the input files other input options and flags should be added for some advanced control over the tool. For the start two options were obvious to add.
The first option is also discribed in the paper. The question is which decoy cut-off to calculate? The simple answer would be: The one that captures all cases where a de-novo hit scores slightly above a db hit. A more conservative solution would be to capture only a fraction of the de-novo hits. For this an input parameter will be implemented to control this fraction. The default should be '1' though.
The second additional input option became clear while thinking about corner cases. What if a search engine just reports the top hit? Or what if for some reason not enough decoy hits are found? Basicly what happends if the tool can't compute a decoy cut-off? Well, if that's the case we can't re-rank. But re-ranking isn't necessary for calculating the database suitability. Therefore a flag is added to force the tool to run without re-ranking. This could of cause result in an underperfoming score, but the option itself doesn't hurt. Especially since for the time being only the xcorr is supported and just turning of the re-ranking enables the use of a any search engine with any other scoring scheme.
Let's go over some implementation choices/problems of the two metrics:
1. Database Suitability
The python scripts provided by the authors of the paper were a lot of help here, but they missed some corner cases.
The first question I raised was: How many peptide identifications with two decoy hits do we need to calculate a suitable decoy cut-off? Because if f.e. only 10 % of identifications have two decoy hits in their top ten, is the assumption, that this decoy difference captures the case of similar de-novo and db hits, still valid? It would probably be adequate to raise a statistic for this, but in order to first get the tool in a working state, after consulting with my mentors we decided that 20 % would be a minimum. This is half of the percentage we saw in some example files.
Secondly, for checking if a hit is considered a decoy hit or a db hit, I look into the protein accessions of each hit. Here I check for the existance of the custom string used to identify the novo-peptide. But a peptide hit has multiple accessions. Therefore it is only considered to be a de-novo hit if all accessions contain the custom string. If at least one accession doesn't, it is considered to be a db hit.
2. Spectral Quality
Because only the number of ms2 spectra is needed for this metric and literally nothing else from the mzML, some memory can be saved by only loading ms2 spectra and don't load any data (i.e. RT, MZ, charge, ...). This luckily is already implemented in OpenMS.
Furthermore the mzML loading can be done in a scope. After the the mzML is loaded into the OpenMS PeakMap object, the objects contains all the ms2 spectra as a vector (without the data of cause). The PeakMap object can than be asked for its size. And after leaving the scope the mzML object is destroyed and the memory is freed.
For counting the peptide identifications there isn't such a trick though. I just load the idXML into a vector of peptide identifications and loop through it. This is because peptide identifications without hits should not be counted and the sequence is of interest.
Output:
The tool itself will write most of its output directly to the command line, but it should be an option to also get a tsv output to make it possible for post-processing tools to get the information. It just needs to be decided what to put out.
Well quite obvious is the database suitability score and the spectral quality, but also the parameters for those metrics can be exported. (i.e. number of top hits found in the database/novo-peptide, how often scored a novo hit just above a db hit, of those times how often did re-ranking happen with which cut-off, ... and for the spectral quality: number of unique de novo sequences, total number of de novo sequences and the number of ms2 spectra)
Next week I will probably open a PR for this and start writing some documentation and tests since both of those are still missing.
Comments
Post a Comment