GSOC Final Report

What was the goal?

A recent publication (found here) describes a metric to evaluate the suitability of a database used for an identification search of LC-MS/MS data. This is useful because sometimes you don't now from which organism the data is coming from. In that case it is also very challenging to find the best database to search with.
OpenMS (codebase located on github) is a huge open source library with a lot of tools to analyze mass spectrometry data. The goal of this project was to implement the freshly proposed metric as a tool in OpenMS.

The paper also describes another metric so score the quality of the recorded spectra. This metric was also to be implemented, but it is a lot more trivial and not doesn't need that much attention.

Was the goal achieved?

Yes, it was.
One week ago the final version of the database suitability tool was merged. This tool calculates database suitability and spectral quality according to the algorithms presented in the paper.
All algorithms are also part of a class in the library for any other tool to use as well.

Change needed for the right identification search to be possible:

First, merged version of the tool:

FDR filtering in the tool rather than before by the user:

Final version (functions exported into a library class):

The documentation of the tool is (as of now) not yet included in the current documentation found for openms. The docu just wasn't build lately. So here's a screenshot of the docu for my tool:

Is the code working?

After the first version of the tool was merged I started with some benchmarking using python scripts. I tested an mzML with multiple different databases with my implementation and also with the paper python scripts.
My implementation worked about as good as the paper.
If you want to read more about this and also see some plots, make sure to read the corresponding blog post.

The more interesting benchmark was a sampling approach. Here I calculated the database suitability for a human mzML with a human database, but the database was consecutively reduced by 10 % of its amino acids. (corresponding blog post)
Here I found out that the suitability doesn't behave linear. That has a mathematical reason though. Since the suitability is calculated as a ratio of two linear functions, the suitability has to behave hyperbolic.

Is something left to do?

Yes, and I already started with that.
Since I had some more time after the paper version of the tool was merged, I tried to fix the non-linear behaviour of the suitability.
The general idea is to make the denominator of the suitability-ratio constant rather than linear. This is done by making sure that the number of database hits increase at the same rate as the de novo hits decrease. I need to calculate a correction factor. This is explained in the last section of this blog post.
How good this correction works can be seen and read here.

Since I just started with this correction there are somethings left to be done. Some bugs need fixes, also the documentation needs to be adapted to the changes, some code is not documented and of cause the tests need to be changed to work with the new correction.

I plan on doing this and therefore finishing the correction of database suitability post-GSOC.

Feature need for database suitability correction:

The current status of the correction can be found at this draft pull request:

If you want to know more about this project I highly recommend reading this blog. :)


Popular posts from this blog

GSOC Week 9 and Week 10: Benchmarking III & Back to C++

Project Plan