One of the most frequently used approaches in proteomics, called
bottom-up or shotgun proteomics, relies on tandem mass spectrometry of peptides
after enzymatic protein digestion and subsequent correlation of the obtained spectral
data with amino acid sequences of a given protein database (Eng et al., 1994).
The final protein identifications have to be inferred from the resulting peptide spectrum machtes (PSMs). This can be surprisingly difficult since peptides can be shared by many different proteins as conserved motifs (for review see Nesvizhskii and Aebersold, 2005). Especially in metaproteomic analyses where target databases contain naturally many homologous proteins from closely-related organisms the proportion of PSMs that can be assigned to more than one protein is high. To handle this, proteins can be clustered based on the shared PSMs (e.g. described by Koskinen et al., 2011). Each of these clusters (or protein groups) is represented by a master protein which has been selected based on PSM coverage and probability scores. However, the involved algorithms are highly diverse and you should refer to the respective manual to get more information on protein grouping provided by the software you are using.
In metaproteomics highly complex protein mixtures are analyzed to get new insights in the taxonimical and functional diversity of more or less defined ecosystems. As described in I.1 protein groups resulting from such analyses can consist of many PSMs sharing proteins belonging to numberless taxonimical origins and involved in various functions. Thus, it is difficult or even impossible to choose a single master protein representing the whole group on taxonimical as well as functional level. With Prophane we provide a fully automatic (but highly adaptable) workflow which is not relying on master protein information but considering all protein members within the protein groups (firstly described in: Schneider et al., 2011). Each group is analyzed regarding commonalities on both taxonimical and functional level between the covered members. Additionally, Prophane eases data inpsection and interpretation by organizing all relevant information and analysis results in intuitive and interactive tables.
The flow chart above shows a simplified model of the workflow provided by the Prophane bioinformatics pipeline. You can move the mouse cursor over the different elements to get more information.
So far Prophane accepts protein report files exported by Scaffold or Scaffold viewer (Proteome Software,http://www.proteomesoftware.com). Such protein reports are basically tab-delimited text-files extended by xls which can be opened by Microsoft Excel, OpenOffice Calc or any text editor. Scaffold’s protein reports are divided into two parts. First, an introducing list of parameters used for the database search and, second, the data table. Please make sure, that the data table contains columns headed by
FASTA files meeting the officiaL standard . There is no limit of length
neither for header lines nor for sequence lines. Multiple headers have to be
separated by SOH (separator of header; ASCII char 001).
The figure demonstrates the tag-dependent accession recognition provided by Prophane. Accession numbers have to be introduced by an accession type tag (e.g. gi) followed by | and introduced by | or header start (>) or multiple header start (SOH). If an accession (defined by accession string and accession type) listed in the submitted protein report has been found, Prophane stores the header information and the aminoacid sequence. Importantly, if the header consists of multiple headers, Prophane adds all accessions of subheaders not sharing any accession with the respective protein group. This is due to the fact that many software suites performing MS database searches consider only header information until the first SOH.
Prophane tries to retrieve annotation data for any protein accession found in your protein report or multiple headers (see II.3). Dependent on the accession type different sources are considered:
If annotation data has been retrieved from different sources it will be checked whether taxonimical and sequence information is consistent (functional annotation is not checked since it is not standardized, see II.7). If this is not the case or data are missing the user is informed and has to select the correct information or to submit the missing information manually. If sequence information is missing or not consistent, it is very important that you provide or select the sequence listed in the target database which has been used for spectra correlation (see I.1).
Since taxonimical annotation is standardized, comparison is easy. Prophane considers seven different taxonimical levels: superkingdom, phylum, class, order, family, genus, and species. The taxonimical lineage of each protein group is elucidated by comparing the taxonimical lineage of the belonging protein members. If the taxonimical unit of the respective level is shared by all group members it is assigned to the group. If not the respective taxonimical level and all subsidiary levels are stated „heterogeneous“.
In contrast to taxonimical annotation (see II.6) functional data is quite diverse and less standardized. To allow the comparison of protein group members on functional level Prophane performs functional predictions for each protein using RPSBLAST or HMMER3. Please read the manuals (RPSBLAST: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSBWhat; HMMER3: ftp://selab.janelia.org/pub/software/hmmer/CURRENT/Userguide.pdf) to get more information about these algorithms. Using RPSBLAST, COG and KOG classifications (Tatusov et al., 2003) can be assigned to prokaryotic and eukaryotic proteins, respectively. A maximal e-value can be defined by the user. Prophane considers taxonimical information of each protein automatically to choose the COG or KOG collection. Using HMMER3 functional predictions are based on Hidden Markov Model profiles provided by TIGRFAMs and PFAMs, respectively (Punta et al., 2012; Haft et al., 2013). In contrast to PFAMs TIGRFAMs consider mainly prokaryotic proteins. The lowest classification levels of TIGRFAMs and PFAMs are functions and motifs/domains, respectively. Results can be restricted by a maximal e-value threshold or the recommended gathering threshold. Finally, the functional prediction which is shared by all members is assigned to each protein group. If there are different functional predictions common to all group members the prediction with the lowest overall e-value (calculated by summing up all returned e-values of the respective prediction) is selected. If there is no common functional prediction the respective group is stated as functionally „heterogeneous“.
estimating protein abundance based on spectral counts. The normalized spectral
abundance factor (introduced by Zybailov et al., 2006)
is calculated in a slightly modified form:
Prophane provides a single output file to the user which can be opened in any generic
web browser (works best in newest versions of Chrome and Firefox). In this file
all images, data and functionalities are embedded to simulate a full result
website which is intuitive to use.
The protein report is separated in different sections:
Eng, J., A. McCormack and J. Yates (1994). "An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database." Journal of the American Society for Mass Spectrometry 5(11): 976-989.
Haft, D. H., J. D. Selengut, R. A. Richter, D. Harkins, M. K. Basu and E. Beck (2013). "TIGRFAMs and Genome Properties in 2013." Nucleic acids research 41(D1): D387-395.
Schneider T, Schmid E, de Castro JV Jr, Cardinale M, Eberl L, Grube M, Berg G, Riedel K. (2011). Structure and function of the symbiosis partners of the lung lichen (Lobaria pulmonaria L. Hoffm.) analyzed by metaproteomics. Proteomics 11(13):2752-6.
Koskinen, V. R., P. A. Emery, D. M. Creasy and J. S. Cottrell (2011). "Hierarchical clustering of shotgun proteomics data." Mol Cell Proteomics 10(6): M110 003822.
Nesvizhskii, A. I. and R. Aebersold (2005). "Interpretation of shotgun proteomic data: the protein inference problem." Molecular & cellular proteomics : MCP 4(10): 1419-1440.
Punta, M., P. C. Coggill, R. Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E. L. Sonnhammer, S. R. Eddy, A. Bateman and R. D. Finn (2012). "The Pfam protein families database." Nucleic acids research 40(Database issue): D290-301.
Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. I. Wolf, J. J. Yin and D. A. Natale (2003). "The COG database: an updated version includes eukaryotes." BMC Bioinformatics 4: 41.
B. Zybailov, A.L. Mosley, M.E. Sardiu, M.K. Coleman, L. Florens, M.P. Washburn (2006). Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. Journal of Proteome Research 9:2339-47.