GSOC Week 1: The Beginning

June 08, 2020

Not much coding happend in the first week. Due to a local holiday this week was also a day shorter than normal.

The main things that happend were:

I read the python scripts written by the authors of the paper.
I created a gantt chart to have an overview over the whole project. It can be found here.
I started working on the first part for the workflow (which you can find in my last blog post). Which is the ability of IDFileConverter to concatenate peptide identifications into one FASTA entry.

While talking to my mentors about what I found out in the python scripts and creating a general plan of how to do what, the following interesting points were raised:

What happens if one just appends ALL de novo sequences to the database rather then only the high scoring ones?
In theory, bad de novo sequences should just act as more decoys during the peptide identification search with Comet/X!Tandem or any other search engine.
This will be checked when the workflow is done!
What about post translational modifications?
If important PTMs are not used during the search, the ID rate will be low. But the metric might still be high, because Novor is missing those PTMs too.

#db_hits / (#db_hits + #novor_hits)

Using this it would be possible to tell the difference between crapy data (or database) and just some missed PTMs.

On a further note if the time allows it an open-modification search (e.g. MSFragger) could be added to the workflow.
The re-scoring described in the paper, which is done after the Comet search, is based on a "cross correlation score". This is just the score that Comet uses to score its hits. Therefore other search engines and their score should also be checked for compability. The 'DatabaseQuality' tool could than handle inputs according to the search engine, which was used.
While I worked on giving IDFileConverter the option to concatenate the peptides into one FASTA entry, it came to our attention that this concatenation could yield some problems.
If a Novor peptide begins with a P (proline), is just appended to the privious peptide and later the search engine uses trypsin with the P cutting rule as an enzyme, this previous peptide will never be found. The search engine will only search for the two concatenated peptides.

AAAAK + PGGGG -> AAAAKPGGGG
KP site will not be cut. AAAAK peptide is lost.

Because no easy solution presented itself in the moment. The problem will be handled if we see a lot of P-beginning peptides.

Next week I will start with giving Novor some example files and counting the number of P-beginning peptides to decide if a the problem needs further observation.
After that I will probably finish the work on IDFileConverter and open a pull request for the feature.

GSOC2020 - Database suitability with de novo methodes

GSOC Week 1: The Beginning

Comments

Post a Comment

Popular posts from this blog

GSOC Final Report

GSOC Week 9 and Week 10: Benchmarking III & Back to C++

Project Plan