GSOC Week 2: 'P' Peptides and First Merged PR

June 15, 2020

It's starting to take shape. My changes to the IDFileConverter were merged this week and now the work on the actual database suitability tool can start!

But first things first: What about those peptides that start with 'P'?

Well, I started this week with a try to determine how big of a problem those pose. To do this I just counted how many de novo peptides start with 'P' in some idXML files after filtering. I basicly ran this pipeline with all the raw files the paper used (those can be found at PRIDE):

FileConverter needs to be run to parse the thermo raw files into mzMLs. After that NovorAdapter calculates de novo peptides, IDFilter filters for the meaningful ones (i.e. length >= 6, charge 2 - 3, score >= 60) and a custom bash script counts the ones starting with 'P'.

count_P_peptides.sh

for f in *.idXML

number_Ppeptides=0

number_peptides=0

ratio=0

hit=false

while read -r line

if [[ "$line" =~ "<PeptideId" ]] && [[ "$hit" == false ]]; then

hit=true

number_peptides=$((number_peptides+1))

if [[ "$line" =~ "<PeptideHit" ]] && [[ "$hit" == true ]]; then

hit=false

if [[ "$line" =~ "sequence=\"P" ]]; then

number_Ppeptides=$((number_Ppeptides+1))

done < "$f"

ratio=$(awk "BEGIN {print 100*($number_Ppeptides"/"$number_peptides)"})

echo "${f} : ${number_Ppeptides}/${number_peptides} (${ratio}%) begin with P"

>> P_Peptides.txt

done

This script just goes through all idXMLs in the folder it's in and counts how many top peptide hits beginn with P. The result is written into a txt file. This could of course be done in python but I thought bash is a better (and faster) choice for such a simple task.

Running this script yields this output:

As you can see the ratios aren't that high, but still, around 1 to 3 % is kind of a lot when we will be using those for a scoring metric. Therefore a solution has be presented.

After consulting with my mentors about this we decided to just put all the P-peptides at the beginning of the concatenated sequence. This doesn't solve the issue of [KR]|P sites being created, but it ensures that no other peptides will be lost during identification search.

This solution is a little bit non-generic though and doesn't work when one tries another enzyme with another cutting rule, but it is a solution for the time being.

With that out of the way I could finish the changes to IDFileConverter. Those changes mainly are:

an additional input flag to give the concatenation option
an additional integer option to control how many hits will be exported

The implementation was quite straight forward. The only decision was how the FASTA header should be designed. Because other de novo search engines exist it's not that good to hard code a ">nv|000000|NOVOR_NOVOR" header like in the paper. Therefore the search engine is extracted from the protein identification and appended with a custom string. This string is then placed in the library so that the database suitability tool is able to search for it.

Some of those decisions were done will the PR was already created and some smaller fixes are not mentioned here. If you want more information you can just read the resolved conversations on the PR: https://github.com/OpenMS/OpenMS/pull/4781

That's basicly it for the second week. In week 3 I'll finally start working on the database suitability tool itself.

GSOC2020 - Database suitability with de novo methodes

GSOC Week 2: 'P' Peptides and First Merged PR

Comments

Post a Comment

Popular posts from this blog

GSOC Final Report

Project Plan

GSOC Week 8: Benchmarking II