Project Plan

Assessing protein sequence database suitability using de novo sequencing describes two metrics which could both be useful. First of all a metric to verify the suitability of a database for a protein identification of a given sample. The other metric scores the quality of the mass spectra regarding their capability to produce high scoring de novo sequences.

1. Database Suitability

The process of calculating the suitablity of a given protein database for a given set of mass spectra consists of mainly six steps. Those are:

  1. Calculate de novo sequences from the mass spectra;
  2. Append high quality (score >= 60) de novo sequences to one another to create one de novo protein;
  3. Add this protein with a unique header to the given protein database;
  4. Run a standard protein search engine (f.e. Comet or X!Tandem);
  5. Re-rank instances where database hits and de novo hits rank 'close' to each other to ensure the database sequence is on top;
  6. Count the number of database and de novo sequences for a given FDR and calculate the quality as follows:

    database quality = #database sequences /
                    (#database sequences + #de novo sequences)

These steps are the ones described in the paper.

The target OpenMS workflow which calculates this metric as in the paper will look something like this:


The input files are highlighted in orange, the new tool in green and important intermediate files in purple.
Why the IDFilter needs to be run might not be that obvious. First of all we probably will only use high scoring sequences for the de novo protein. Second of all peptides under a lenght of six amino acids can be discarded. And third of all only peptides with a precursor charge state two and three will be added. The last two conditions are standard assumptions which are also done in the paper. Although not directly mentioned a quick look into the python scripts used by the authors yields this information.
As the name implies IDFilter will filter the given peptide identifications accordingly.

What needs to be done to get this workflow, well, working?
  • IDFileConverter can convert an idXML to a FASTA file, however it just writes each peptide hit as one FASTA entry. Therefore a flag needs to be added to make it possible to concatenate the first of each peptide hit and write only one FASTA entry.
  • The 'DatabaseQuality' tool does not exist for the time being and needs to be coded.
2. Spectral Quality

Calculating the spectral quality should be a little bit simpler. The paper describes it as the ratio of spectra which produced a sequence with a novor score >= 60 to all ms2 spectra. This should be easily calculated and will also be done in the 'DatabaseQuality' tool.

Comments

Popular posts from this blog

GSOC Final Report

Week 12: Suitability Correction (& some PR Work)

Week 11: Preparing the New Suitability Calculation