GSOC Week 4: Tool Fine Tuning
After basicly finishing the core body of the TOOP tool last week I started this week with opening a PR and getting some feedback. There is not a lot to talk about the PR and the suggested changes itself because mostly they were just minor fixes. If your still interested you can go see for yourself here.
Because the tool itself is now in a working state (It just needs to be polished a little bit more.), the next step is to check if the calculations are correct. If we assume that the python scripts by the original authors of the paper are correct (which the should be), we can just compare their output to the tool output.
Therefore I downloaded the scripts and tried to get them working. First of all I had to make sure that the inputs of both the tool and the scripts are the same. The tool (as it is part of OpenMS) accepts only OpenMS file types. Because the authors didn't work with a library, they just took the *.pep.xml output from Comet and hardcoded a script to read and write within it. The output of the CometAdapter, which binds the Comet executable into OpenMS, writes an *.idXML. This idXML is generated from the original pep.xml and should contain the same information. To compare the tool to the scripts I just had the tool calculate on the idXML and the scripts on the original pep.xml.
As it turns out it was very good to do that because I found some inconsistancies between my approach and theirs. The decoy cut-off was identical but nevertheless the number of re-ranked hits and also the number of hits that are in consideration to be re-ranked was diffrent, not by a lot, but still a little off.
To find out what was happening there I added a debug output to both the script and the tool. This debug output consists of information about the possible re-ranked hits. It starts with the RT of the considered hit, followed by true/false depending on if it was re-ranked or not. After this a quick call of git diff yields the peptide identifications that are handled differently. I than looked into the idXML and pep.xml files to check what was special about these identifications.
The first thing I found out was if two de novo hits scored top and second the tool considers the whole peptide identification as de novo while the paper keeps on searching for the first datebase hit and checks for re-ranking. This seemed reasonable and I changed the handling of that case in the tool. But the numbers still were not identical. So, same procedure as before: git diff on the debug output.
My second discovery was that the paper checks if a hit is from considered a de novo hit slighly more conservative than the tool. As I already explained in my last post, in the tool a peptide hit is considerd to be a de novo hit only if all protein accessions contain the custom string. If at least one accessions doesn't contain the string and is therefore a database protein, the peptide hit is considered to be a database hit. This was quite easy to check since in the idXML format all protein accessions are grouped together. (Note: The protein accessions are given by their custom identifier and can be mapped to the protein identification.)
This isn't the case in the pep.xml however. Here a main protein hit is given in addition to some alternative protein hits.
This results in a more conservative decision by the authors. They only check for the main protein hit and ignore the alternative ones. Since there isn't a script to calculate the database suitability, I don't know how this was handled while counting de novo and database identifications after re-ranking.
This difference will not be changed in the tool because it doesn't seem reasonable to do so. Allowing one protein accession of the database to decide that the peptide hit belongs to the database is probably the right decision.
But just to be sure that this was the last difference I changed it and checked for identical output. This was the case and the algorithm of the tool can therefore probably considered to be correct.
While all of this was happening a question about the xcorr score kind of randomly arose. The question was how the xcorr is normalized. In the paper the just divide with the neutral mass of the peptide. This works, but seems a little bit cheap and it's not clear if that is the best way. Another way was already presented in the paper that presents "PeptideProphet". This is technicaly another xcorr score, but the normalization should still work. Which normalization works better for our case will be tested when the benchmarking phase begins.
This process was quite time consuming and apart from a little bit documentation that was all I did this week. For next week I will look into how to use @dot in doxygen to plot a pipeline for the documentation and hopefully get the documentation done.
Comments
Post a Comment