GSOC Week 2: 'P' Peptides and First Merged PR
It's starting to take shape. My changes to the IDFileConverter were
merged this week and now the work on the actual database suitability
tool can start!
But first things first: What about those peptides that start with
'P'?
Well, I started this week with a try to determine how big of a problem
those pose. To do this I just counted how many
de novo peptides start with 'P' in some
idXML files after filtering. I basicly ran this pipeline with all the raw
files the paper used (those can be found at
PRIDE):
FileConverter needs to be run to parse the thermo raw files into mzMLs.
After that NovorAdapter calculates
de novo peptides, IDFilter filters for
the meaningful ones (i.e. length >= 6, charge 2 - 3, score >= 60) and
a custom bash script counts the ones starting with 'P'.
count_P_peptides.sh
for f in *.idXML
do
number_Ppeptides=0
number_peptides=0
ratio=0
hit=false
while read -r line
do
if [[ "$line" =~ "<PeptideId" ]] && [[ "$hit" == false ]]; then
hit=true
number_peptides=$((number_peptides+1))
fi
if [[ "$line" =~ "<PeptideHit" ]] && [[ "$hit" == true ]]; then
hit=false
if [[ "$line" =~ "sequence=\"P" ]]; then
number_Ppeptides=$((number_Ppeptides+1))
fi
fi
done < "$f"
ratio=$(awk "BEGIN {print 100*($number_Ppeptides"/"$number_peptides)"})
echo "${f} : ${number_Ppeptides}/${number_peptides} (${ratio}%) begin with P"
>> P_Peptides.txt
done
This script just goes through all idXMLs in the folder it's in and counts
how many top peptide hits beginn with P. The result is written into a txt
file. This could of course be done in python but I thought bash is a better
(and faster) choice for such a simple task.
Running this script yields this output:
As you can see the ratios aren't that high, but still, around 1 to 3 % is
kind of a lot when we will be using those for a scoring metric. Therefore a
solution has be presented.
After consulting with my mentors about this we decided to just put all the
P-peptides at the beginning of the concatenated sequence. This doesn't solve
the issue of [KR]|P sites being created, but it ensures that no other
peptides will be lost during identification search.
This solution is a little bit non-generic though and doesn't work when one
tries another enzyme with another cutting rule, but it is a solution
for the time being.
With that out of the way I could finish the changes to IDFileConverter.
Those changes mainly are:
- an additional input flag to give the concatenation option
- an additional integer option to control how many hits will be exported
The implementation was quite straight forward. The only decision was how the
FASTA header should be designed. Because other de novo search
engines exist it's not that good to hard code a ">nv|000000|NOVOR_NOVOR"
header like in the paper. Therefore the search engine is extracted from the
protein identification and appended with a custom string. This string is
then placed in the library so that the database suitability tool is able to
search for it.
Some of those decisions were done will the PR was already created and some
smaller fixes are not mentioned here. If you want more information you can
just read the resolved conversations on the PR: https://github.com/OpenMS/OpenMS/pull/4781
That's basicly it for the second week. In week 3 I'll finally start working
on the database suitability tool itself.
Comments
Post a Comment