Week 13: The Final Week

In the final week of GSOC I implemented and tested the correction function for the database suitability.

Implementing

Implementing didn't take that long. This was mostly because the main functionalities were already implemented by me for the other try of correcting suitability. I just had to change some minor stuff for them to work.

The only thing I did need to add was the functionality to calculate some decoy entries. Since I want to sample the given database, it's not really possible to take a file already containing decoys. Unfortunately the decoy calculations are not, as for now, exported into a library class. So, I just copied some code from the DecoyDatabase UTIL. That's not good style and might be changed later on, but for now it was the fastest solution. And fast solutions are needed considering it's the last week.

Benchmarking

I than tested the corrected suitability on the human data from before and also on some data from a roundworm (C. Elegans). The files for the latter I also got from the PRIDE folder of the original paper.

Human:


C. Elegans:

As you can see the correction does work pretty well!

Let's first ignore the outliers around ratio of 0.5. When we do that, the correction does a pretty good job of linearization of the suitability. The red lines depicts the suitability/deNovo hits before correction and the blue lines after correction.

For the human data the corrected suitability graph almost perfectly represents the diagonal. This is exactly what we wanted. Because now the suitability corresponds to the ratio of the used database. This is something that's far easier to explain to users. Before that suitability could only be used as a comparing tool and now the score actual means something.

Looking at the C. Elegans plot the graph is a little bit more off. That's because the "right" database scores way lower than in the human example.

Now to the outliers: I am not quite sure why this is happening. The internal extrapolation should execute suitability calculations for ratios of 0.2 to 0.3 here and looking at the non-corrected data the values at those ratios are very stable. It could of cause very well be a bug in my code. Since those outliers appear in both cases and also around a ratio of 0.5, it might also be a general problem with the approach. To find out what is happening here I'm going to give some debug outputs in the suitability class. A fix could be to do multiple subsampling runs, but it's hard to say as long as I'm not sure what exactly is causing the behaviour.

 

Unfortunately the week is over now. For now this will be the final point of this project. I will write a summary of all the things that I did and post it on this blog.

Post-GSOC I'm probably finishing the correction of the suitability to get this functionality to a working state.

Comments

Popular posts from this blog

GSOC Final Report

Week 12: Suitability Correction (& some PR Work)