Week 12: Suitability Correction (& some PR Work)

August 26, 2020

At the start of this week only the PR containing the changes to the adapters passed. I still had to work on some stuff for the suitability functions. But since these changes were not that complex and reviewing of my changes also took some time, I merged my suitability class on a local branch to start working on the corrected suitability. This is would I did while waiting for reviews or tests to finish:

Changes for Suitability Correction

For this correction to work a lot of additional functions needed to be coded. Also the input of the compute function had to change drastically because for the two internal identification searches a lot of additonal information is needed. First of all a mzML input to have something to search at all. Second of all the two databases, the original database and the deNovo "database" (with one entry which is a concatination of all deNovo sequences), to have something to search with. And third of all the parameters to know which search adapter to use with which settings.

It's pretty straight foreward what the new functions need to do. The search adapter informations need to be extracted, an identification search followed by indexing (target/decoy information) and q-value calculations executed and the number of identifications found in the searches counted.

The most complex of these was by far the implementation of an internal identification search. But since some of the TOPP tools execute external tools, I could look up how to call a search adapter from within the library. To parse the parameters for the adapter I'm using an INI file. This can easly be written from a 'Param' object and than be given to the adapter call as a parameter. For this to work the INI file has to exist as an actual file though. That's why I create a temporary folder where I save the INI file and the other needed inputs (mzML and FASTA) and also the output from the identification search. I than load the output back into the library and the temporary folder can be deleted.

The indexing and FDR calculations are already exported into the library. I was able to just use the existing classes for this and didn't have to call these TOPP tools too.

And after that the first try for correcting the suitability was done.

Testing the Correction

I than changed my benchmarking python skripts in a way that they could test the new functionalities. For this I basicly just had to change the input of DatabaseSuitability and had to calculate some additional FASTA files.

After I also adapted the plotting scripts the following data presented itself:

You can see that the correction didn't really work that well. That has two reasons:

1. I switched numerator and denominator in the factor. It is supposed to be #ids with only db / #ids with only novo and not the other way around. I also had this wrong in my last two posts. But the bigger problem can be seen in the second plot.

2. The idea was that the correction factor would be constant. That is obviously not the case since the curve of top deNovo hits isn't linear anymore. This is the case because the search with only the database is always done with the given database. When this database gets worse we get less hits.

The New (now better) Correction Factor

So, the question is how to get the factor constant. Well, the idea still stands:

#peptide identifications with an only db search / #peptide identification with an only deNovo search

But we need to get the numerator constant too. So, the idea is to use the number of identifications found when searching with a ratio of "1.0" and to extrapolate some data to get there.

First we figure out which actual ratio corresponds with "1.0". This will be done using the deNovo hits because at "1.0" the deNovo hits should be 0 (at least that can be assumed). After we got that we can extrapolate the number of db hits to the corresponding ratio. For this we got to assume that at ratio "0" the db hits will be 0. And than we got ourselfs a constant factor.

For this to work we need to do one identification search with a smaller sample of the database (to get the data to extrapolate) and one identification search using only the deNovo "database" (to get the denominator of the correction factor).

All of this will be done next (the last!) week though.

At the end of this week the PR with the export of the Suitability functions got passed. So starting next week I will probably have to resolve some merge conflicts first.

GSOC2020 - Database suitability with de novo methodes

Week 12: Suitability Correction (& some PR Work)

Comments

Post a Comment

Popular posts from this blog

GSOC Final Report

GSOC Week 8: Benchmarking II

Project Plan