GSOC Week 6: Manual Testing & Start Benchmarking
Because the code part of the tool was already reviewed, the only thing worth talking about in the PR was the testing.
Since I'm testing the tool with a full dataset and not a little manually crafted file, the test files are a little big (about 6 MB for the idXML). It would obviously be better for the general size of the library to make these files a little smaller. This is what I tried at the beginning of this week.
First I tried to filter the original mzML file for a specific RT range. This didn't quite work that well. First of all it didn't reduce the size of the resulting idXML by all that much, just above half of the original. (Probably because the protein section is still almost the same even though there are less peptide identifications.) Second of all it was actually quite hard to find a suitable RT range because the range had to be small enough to actual reduce the size, but still big enough that some re-ranking did occur. After trying some different parameters and still not producing a suitable new test file I consulted my mentors and we decided that this will probably just not work.
Another idea was to just pick some interesting re-ranked cases from the current test file manually and thus producing a manual test idXML file with only selected peptide identifications. This is very useful because it can be assured that corner cases are well handled. I was half-way done with that when I figured that this wouldn't work. Because the decoy cut-off is calculated over all peptide identifications, the decoy cut-off will change significantly, if I just pick some IDs. Therefore the handpicked interesting cases are no longer interesting, because their re-ranking was done based on another decoy cut-off.
For this whole thing to work I would have to manually change all the scores to know which decoy cut-off will be calculated and to then know which IDs will be re-ranked. That is a lot of work and not feasible for a simple test.
I talked with my mentors and we decided that this can later be done more easily when the core algorithms are put into the library as classes.
Also I discovered that in the current test file the most frequent corner case, a peptide hit marked as target+decoy, already appears.
So, no further changes need to be done and the tool with its documentation and tests was merged into develop!
This took like half of this week. The rest of this week I started with some benchmarking.
In the paper they test their approach with a series of different databases for one known mzML. To be more specific: They calculate the suitability of databases containing the proteasome of different species to search a mzML file coming from a human. They use multiple databases, but the most interesting ones are the primate ones and a shuffled human database. The scores of the database of the primates score according to their relatedness to humans. That's what I would expect, but its still pretty cool that this works with this score. The shuffled human database contains the human proteasome, but between the trypsin cutting sites the aminoacids are shuffled. This results in a database that has nothing to do with the mzML and therefore the suitability score should be very low. In the paper that's the case.
Now the goal of benchmarking is to check if my tool can produce the same results.
But first we had another idea: If we have a database that perfectly fits the mzML and a database that's as far of as possible (i.e. a shuffled one) what would happen if we mix them together, f.e. 50 % from the first and 50 % from the second? It could be a good test series to check different percentages of combinations.
To prepare this I wrote a python script which merges two given databases with a provided ratio.
This is probably not the best way of doing this, but it is the most straightforward approach.
First I check how many entries are in the two files. This times the ratio and respectively 1-ratio returns the number of fasta entries to export into the new merged database. After that the first ratio*N1 entries from the first fasta file and the first (1-ratio)*N2 entries from the second fasta file are written into the output and we are done.
But before I will do the test series with the merged the fasta files, I started with the series from the paper based on relatedness. Unfortunately this takes quite a lot of time mostly because of the Comet search. Therefore I could only finish the primates this week, but they look quite reasonable and I have high hopes for next week. There I will finish this series and than start with the merged fasta files.
Comments
Post a Comment