GSOC Final Report

What was the goal? A recent publication (found here ) describes a metric to evaluate the suitability of a database used for an identification search of LC-MS/MS data. This is useful because sometimes you don't now from which organism the data is coming from. In that case it is also very challenging to find the best database to search with. OpenMS (codebase located on github ) is a huge open source library with a lot of tools to analyze  mass spectrometry  data. The goal of this project was to implement the freshly proposed metric as a tool in OpenMS. The paper also describes another metric so score the quality of the recorded spectra. This metric was also to be implemented, but it is a lot more trivial and not doesn't need that much attention. Was the goal achieved? Yes, it was. One week ago the final version of the database suitability tool was merged. This tool calculates database suitability and spectral quality according to the algorithms presented in the paper. All algorit

Week 13: The Final Week

In the final week of GSOC I implemented and tested the correction function for the database suitability. Implementing Implementing didn't take that long. This was mostly because the main functionalities were already implemented by me for the other try of correcting suitability. I just had to change some minor stuff for them to work. The only thing I did need to add was the functionality to calculate some decoy entries. Since I want to sample the given database, it's not really possible to take a file already containing decoys. Unfortunately the decoy calculations are not, as for now, exported into a library class. So, I just copied some code from the DecoyDatabase UTIL. That's not good style and might be changed later on, but for now it was the fastest solution. And fast solutions are needed considering it's the last week. Benchmarking I than tested the corrected suitability on the human data from before and also on some data from a roundworm (C. Elegans). The files fo

Week 12: Suitability Correction (& some PR Work)

At the start of this week only the PR containing the changes to the adapters passed. I still had to work on some stuff for the suitability functions. But since these changes were not that complex and reviewing of my changes also took some time, I merged my suitability class on a local branch to start working on the corrected suitability. This is would I did while waiting for reviews or tests to finish: Changes for Suitability Correction For this correction to work a lot of additional functions needed to be coded. Also the input of the compute function had to change drastically because for the two internal identification searches a lot of additonal information is needed. First of all a mzML input to have something to search at all. Second of all the two databases, the original database and the deNovo "database" (with one entry which is a concatination of all deNovo sequences), to have something to search with. And third of all the parameters to know which search adapter to use

Week 11: Preparing the New Suitability Calculation

The plan on how to correct the suitability to behave more linear is to correct the number of top deNovo hits found. This is done with a factor which is calculated like this: #identifications with only deNovo / #identifications with only the database To calculate this factor two individual identification searches need to be performed, one with only the database without the deNovo peptide and one the other way around. It is very crucial that those searches are done with the same settings as the original database+deNovo one. Currently those infomations are not completely exported by the OpenMS-Adapters that bind the search engines into the library. Only a few informations are written. So that's what I did first this week. Export Adapter Parameters I added a static function to DefaultParamHandler (This is the OpenMS class that is build for parameter stuff.) which takes a Param object (storage for parameters), a MetaInfoInterface object (here meta values can be written) and a string whi

GSOC Week 9 and Week 10: Benchmarking III & Back to C++

The last two weeks were very stressfull and I just didn't manage to write a blog entry. So, I will just summarize both weeks in this one entry. Week 9: Benchmarking is finished for now. I did the sampling approach I talked about last week and I also tested some different FDRs to check dependency. For sampling it should be noted that I only did 1/ratio number of runs for each database ratio to save some running time. The resulting plot looks like this: In the top left you can see that the suitability is pretty much independent from the used FDR which is generally speaking pretty good. Unfortunatly the differences between paper and OpenMS workflow can not be explained with FDR differences. They probably are a result of PeptideProphet vs. target/decoy FDR. To test this I would need to run the OpenMS workflow with PeptideProphet. Maybe I will do this later, but for now the reason behind the small differences isn't that important. Top right, bottom left and bottom right are way mor

GSOC Week 8: Benchmarking II

Finally some plots are ready! I finished the python script that executes the paper workflow. Here I did the FDR MinProb transition I wrote about in my last post. I than wrote another python script to execute the OpenMS workflow. (That workflow can be found in my project plan .) Both of these scripts can be found here . Finishing both of these scripts I followed with a quick plotting script and done! The first comparison is ready: This looks promising! The more related the species are the higher the suitability. It would be nice to increase the resolution at the top end of the suitability to be able to differentiate the "right" database better. We are not sure how or if this would be possible since it very well can be that despite the different genomes in related species the proteasome can be even more similar. It would be nice get a similarity of proteasomes for this. When comparing the paper results to the OpenMS results it can be noted that the paper consistently scores abo

GSOC Week 7: Benchmarking

I started this week where I ended last week. I finished the benchmarking series where I test the database suitability tool with a human mzML and databases from multiple other organisms. As I already described in my last blog post the tested databases where from multiple primates, some other non-related species and one shuffled human database. After the first run finished I was a little worried because no database scored under 70 % suitability. This means something is not working properly. Luckily it didn't take me much time to figure out what it was. I forgot to filter for FDR. This will obviously result in a wrong output since all hits are counted regardless of their q-value. So, once again I needed to tweak the tool. I added a user input for a FDR, checked for this FDR before counting hits as either 'db' or 'novo' and changed the test and documentation. I then created a PR for those changes. This PR was merged without much discussion. I than ran the benchmarking