GSOC2020 - Database suitability with de novo methodes

Posts

Week 13: The Final Week

August 30, 2020

In the final week of GSOC I implemented and tested the correction function for the database suitability. Implementing Implementing didn't take that long. This was mostly because the main functionalities were already implemented by me for the other try of correcting suitability. I just had to change some minor stuff for them to work. The only thing I did need to add was the functionality to calculate some decoy entries. Since I want to sample the given database, it's not really possible to take a file already containing decoys. Unfortunately the decoy calculations are not, as for now, exported into a library class. So, I just copied some code from the DecoyDatabase UTIL. That's not good style and might be changed later on, but for now it was the fastest solution. And fast solutions are needed considering it's the last week. Benchmarking I than tested the corrected suitability on the human data from before and also on some data from a roundworm (C. Elegans). The files fo...

Week 12: Suitability Correction (& some PR Work)

August 26, 2020

At the start of this week only the PR containing the changes to the adapters passed. I still had to work on some stuff for the suitability functions. But since these changes were not that complex and reviewing of my changes also took some time, I merged my suitability class on a local branch to start working on the corrected suitability. This is would I did while waiting for reviews or tests to finish: Changes for Suitability Correction For this correction to work a lot of additional functions needed to be coded. Also the input of the compute function had to change drastically because for the two internal identification searches a lot of additonal information is needed. First of all a mzML input to have something to search at all. Second of all the two databases, the original database and the deNovo "database" (with one entry which is a concatination of all deNovo sequences), to have something to search with. And third of all the parameters to know which search adapter to use...

Week 11: Preparing the New Suitability Calculation

August 17, 2020

The plan on how to correct the suitability to behave more linear is to correct the number of top deNovo hits found. This is done with a factor which is calculated like this: #identifications with only deNovo / #identifications with only the database To calculate this factor two individual identification searches need to be performed, one with only the database without the deNovo peptide and one the other way around. It is very crucial that those searches are done with the same settings as the original database+deNovo one. Currently those infomations are not completely exported by the OpenMS-Adapters that bind the search engines into the library. Only a few informations are written. So that's what I did first this week. Export Adapter Parameters I added a static function to DefaultParamHandler (This is the OpenMS class that is build for parameter stuff.) which takes a Param object (storage for parameters), a MetaInfoInterface object (here meta values can be written) and a string whi...

GSOC Week 9 and Week 10: Benchmarking III & Back to C++

August 11, 2020

The last two weeks were very stressfull and I just didn't manage to write a blog entry. So, I will just summarize both weeks in this one entry. Week 9: Benchmarking is finished for now. I did the sampling approach I talked about last week and I also tested some different FDRs to check dependency. For sampling it should be noted that I only did 1/ratio number of runs for each database ratio to save some running time. The resulting plot looks like this: In the top left you can see that the suitability is pretty much independent from the used FDR which is generally speaking pretty good. Unfortunatly the differences between paper and OpenMS workflow can not be explained with FDR differences. They probably are a result of PeptideProphet vs. target/decoy FDR. To test this I would need to run the OpenMS workflow with PeptideProphet. Maybe I will do this later, but for now the reason behind the small differences isn't that important. Top right, bottom left and bottom right are way mor...

GSOC Week 8: Benchmarking II

July 28, 2020

Finally some plots are ready! I finished the python script that executes the paper workflow. Here I did the FDR MinProb transition I wrote about in my last post. I than wrote another python script to execute the OpenMS workflow. (That workflow can be found in my project plan .) Both of these scripts can be found here . Finishing both of these scripts I followed with a quick plotting script and done! The first comparison is ready: This looks promising! The more related the species are the higher the suitability. It would be nice to increase the resolution at the top end of the suitability to be able to differentiate the "right" database better. We are not sure how or if this would be possible since it very well can be that despite the different genomes in related species the proteasome can be even more similar. It would be nice get a similarity of proteasomes for this. When comparing the paper results to the OpenMS results it can be noted that the paper consistently scores abo...

GSOC Week 7: Benchmarking

July 20, 2020

I started this week where I ended last week. I finished the benchmarking series where I test the database suitability tool with a human mzML and databases from multiple other organisms. As I already described in my last blog post the tested databases where from multiple primates, some other non-related species and one shuffled human database. After the first run finished I was a little worried because no database scored under 70 % suitability. This means something is not working properly. Luckily it didn't take me much time to figure out what it was. I forgot to filter for FDR. This will obviously result in a wrong output since all hits are counted regardless of their q-value. So, once again I needed to tweak the tool. I added a user input for a FDR, checked for this FDR before counting hits as either 'db' or 'novo' and changed the test and documentation. I then created a PR for those changes. This PR was merged without much discussion. I than ran the benchmarking ...