GSOC Week 2: 'P' Peptides and First Merged PR

It's starting to take shape. My changes to the IDFileConverter were merged this week and now the work on the actual database suitability tool can start!

But first things first: What about those peptides that start with 'P'?

Well, I started this week with a try to determine how big of a problem those pose. To do this I just counted how many de novo peptides start with 'P' in some idXML files after filtering. I basicly ran this pipeline with all the raw files the paper used (those can be found at PRIDE):


FileConverter needs to be run to parse the thermo raw files into mzMLs. After that NovorAdapter calculates de novo peptides, IDFilter filters for the meaningful ones (i.e. length >= 6, charge 2 - 3, score >= 60) and a custom bash script counts the ones starting with 'P'.

count_P_peptides.sh                                                                  
                                                                                     
for f in *.idXML                                                                     
do                                                                                   
number_Ppeptides=0                                                           
number_peptides=0                                                            
ratio=0                                                                      
hit=false                                                                    
while read -r line                                                           
do                                                                           
if [[ "$line" =~ "<PeptideId" ]] && [[ "$hit" == false ]]; then      
hit=true                                                     
number_peptides=$((number_peptides+1))                       
fi                                                                   
if [[ "$line" =~ "<PeptideHit" ]] && [[ "$hit" == true ]]; then      
hit=false                                                    
if [[ "$line" =~ "sequence=\"P" ]]; then                     
number_Ppeptides=$((number_Ppeptides+1))             
fi                                                           
fi                                                     
done < "$f"                                                                  
ratio=$(awk "BEGIN {print 100*($number_Ppeptides"/"$number_peptides)"})      
echo "${f} : ${number_Ppeptides}/${number_peptides} (${ratio}%) begin with P"
>> P_Peptides.txt                                                                    
done                                                                                 

                                                                                                                                   

This script just goes through all idXMLs in the folder it's in and counts how many top peptide hits beginn with P. The result is written into a txt file. This could of course be done in python but I thought bash is a better (and faster) choice for such a simple task.
Running this script yields this output:


As you can see the ratios aren't that high, but still, around 1 to 3 % is kind of a lot when we will be using those for a scoring metric. Therefore a solution has be presented.
After consulting with my mentors about this we decided to just put all the P-peptides at the beginning of the concatenated sequence. This doesn't solve the issue of [KR]|P sites being created, but it ensures that no other peptides will be lost during identification search.
This solution is a little bit non-generic though and doesn't work when one tries another enzyme with another cutting rule, but it is a solution for the time being.

With that out of the way I could finish the changes to IDFileConverter. Those changes mainly are:
  • an additional input flag to give the concatenation option
  • an additional integer option to control how many hits will be exported
The implementation was quite straight forward. The only decision was how the FASTA header should be designed. Because other de novo search engines exist it's not that good to hard code a ">nv|000000|NOVOR_NOVOR" header like in the paper. Therefore the search engine is extracted from the protein identification and appended with a custom string. This string is then placed in the library so that the database suitability tool is able to search for it.

Some of those decisions were done will the PR was already created and some smaller fixes are not mentioned here. If you want more information you can just read the resolved conversations on the PR:  https://github.com/OpenMS/OpenMS/pull/4781

That's basicly it for the second week. In week 3 I'll finally start working on the database suitability tool itself.

Comments

Popular posts from this blog

GSOC Final Report

GSOC Week 8: Benchmarking II

Project Plan