GSOC Week 9 and Week 10: Benchmarking III & Back to C++

The last two weeks were very stressfull and I just didn't manage to write a blog entry. So, I will just summarize both weeks in this one entry.

Week 9:

Benchmarking is finished for now.
I did the sampling approach I talked about last week and I also tested some different FDRs to check dependency.
For sampling it should be noted that I only did 1/ratio number of runs for each database ratio to save some running time.
The resulting plot looks like this:
In the top left you can see that the suitability is pretty much independent from the used FDR which is generally speaking pretty good. Unfortunatly the differences between paper and OpenMS workflow can not be explained with FDR differences. They probably are a result of PeptideProphet vs. target/decoy FDR. To test this I would need to run the OpenMS workflow with PeptideProphet. Maybe I will do this later, but for now the reason behind the small differences isn't that important.

Top right, bottom left and bottom right are way more interesting and important. In the top right you can see the suitability plotted against the ratio of the database that was used for the identification search. The smaller the ratio the more runs were done because the variance in the database can be much higher.
The curve looks pretty alright. The bigger the ratio gets the higher the suitability gets. The only concern is that the curve isn't nice and linear as we would want it to be and also as one would expect since the database gets linearly better. If we ask ourselfs were this linearity is, we just need to look at the bottom plots. Here we can see the top database hits increasing and the top deNovo hits decreasing in linear fashion for increasing ratios. Why doesn't look the suitability that linear? That's because of the formular with which the suitability is calculated:

                    suitability = #db_hits / (#db_hits + #deNovo_hits)
 
If both database hits and deNovo_hits in-/decrease linearly suitbility can only increase linearly if the two kind of hits have the same slope. In this case (#db_hits + t*x) + (#deNovo_hits - t*x) would be constant and therefore suitability would increase linearly.
The two functions have way different slopes though. If the database gets better there are way more new database hits found than old deNovo hits are replaced with higher scoring database hits; Thus the slope of the database hits is way steeper.

To get the suitability curve to be more linear we either need to correct the number of database hits down or the number of deNovo hits up. The latter seems more reasonable.
For this correction we can use the linearity we would expect for each kind of hits. If an unknown database produces n database hits and m deNovo hits we can correct the number of deNovo hits. To do this we know how many hits are found when searching only with the deNovo sequences and how many are found without them. From here the factor by which the number of deNovo sequences will be corrected can be calculated like this:

            #identifications with only deNovo / #identifications with only the database
 
#identifications with only deNovo should be constant and independent from the database. Therefore the 'better' the database the lower this factor will be since #identifications with only the database will increase. Correcting the deNovo hits like this should result in a flatter, more linear suitability curve.

This discussion concludes week 9. In week 10 I'm finally going to code more C++!

Week 10:

To calculate the suitability like described above more coding needs to happen.
There are two possibilities to do this. We could just ask the user to do these searches for us and give the results to the tool. This is the lazy approach. It's probably better to perform to comet searches ourselfs inside the tool.
This is mildly complicated and will be easier when the TOPPTool functions are exported into the library. So, this is what I started with.

Exporting Suitability Calculations

I basicly took all the functions from the TOPPTool and placed them in a new class which I called 'Suitability'. Here the suitability calculations are done. I designed the class to allow for multiple usage which means you can compute multiple suitabilities, the results are stored internally and you can then access all results at once. This is useful because you don't need a new object for each calculation.

Now I needed to write a test for the new 'Suitability' class. The complicated part was to design a vector of identification that tests all important cases. Those cases are: top scoring deNovo hit, top scoring database hit and two cases where re-ranking can happend. These last two should differ in there xcorr score difference, so that both, one or none are re-ranked depending on the 'novo_fract' setting. This setting controls which decoy cut-off is used.
I also need to test error throwing cases and non error throwing but handled cases. Those include: two few decoy hits to calculate a cut-off, missing target/decoy annotation, missing xcorr score, 'target+decoy' value as target/decoy information.
Not to forget some decoy hits need to added for calculating some different decoy cut-offs.

Exporting Spectral Quality Calculations

When it comes to the spectral quality I could have also just made a new class. But this wouldn't be that reasonable because there already exists a class called Ms2IdentificationRate which calculates an id-rate. The spectral quality is nothing else than the id-rate of deNovo peptides. Just use this class than, right? Well yes, but not without some changes.
This class was written for quality control calculations which are done one a feature map and not on identification level. Hence I added a second compute function which does the same calculations but on a vector of peptide identifications. I then moved some functionalities from the compute functions into private member functions to avoid code duplication.
After some documentation I added a test for this new function. This was pretty easy because I could just repeat the test for the original compute function without adding the identifications to a feature map before.


All of this was pretty time consuming. I just made a PR for my changes and was done with week 10.

Next week I will add FDR calculations to the compute function since its easier to do this ourselfs than to describe what we need to the user.
After this is done I will start with a new function in DefaultParamHandler which writes parameters to meta values. This will be used to export search engine settings into the output. Later tools can then look up the exact search parameters. Especially Suitability can look up how the search was perfomed to do the two searches needed for suitability correction with the same parameters.





Comments

Popular posts from this blog

GSOC Final Report

Week 13: The Final Week

Week 12: Suitability Correction (& some PR Work)