Title: Combining features in a graphical model to predict protein interaction sites Authors: Torsten Wierschin, torsten.wierschin [at] uni-greifswald.de Keyu Wang, kwang [at] informatik.uni-goettingen.de Marlon Welter, marlon.welter [at] cs.uni-goettingen.de Stephan Waack, waack [at] informatik.uni-goettingen.de Mario Stanke, mario.stanke [at] uni-greifswald.de Key words: conditional random field, probabilistic interface labeling problem, free energy feature Abstract: Large efforts have been made in classifying residues as interaction interfaces in proteins using machine learning methods. The model class of conditional random fields (CRFs) has successfully been used for protein interaction site prediction. Using CRFs, the prediction task can be translated into the computational challenge of assigning each residue the label interface or non-interface given some observations. In this work, observational data comes from various possibly highly correlated sources. It includes the structure of the protein but not the structure of the complex. We present a new method called DFE-CRF that models the dependencies of residues using a neighbourhood graph, that is - unlike earlier CRF models - not a linear chain. We introduce a novel node feature ``change in free energy". Model parameters are trained by adapting the Online Large-Margin algorithm. DFE-CRF achieves higher prediction accuracy when using the standard feature class relative accessible surface area compared to the linear chain CRF of Li et al. on a data set containing 1276 protein chains published by Keskin et al. . It performs significantly better on a large range of false positive rates than the Prescont method of Zellner et al. on a homodimer set containing 128 chains. DFE-CRF has a broader scope than Prescont since it is not constrained to protein subgroups and requires no MSA. The improvement is attributed to the advantageous combination of the novel node feature with the standard feature and to the adopted parameter training. References: Keskin O, Tsai CJ, Wolfson H, Nussinov R: A new, structurally nonredundant, diverse data set of protein-protein interfaces and its implications. Protein Science 2004, 13:1043-1055. Zellner H, Staudigel M, Trenner M, Bittkowski M, Wolowski V, Icking M, Merkl R: PresCont: Predicting Protein-Protein Interfaces Utilizing Four Residue Properties. Proteins: Structure, Function and Bioinformatics 2011. Li MH, Lin L, Wang XL, Liu T: Protein-protein interaction site prediction based on conditional random fields. Bioinformatics 2007, 23(5):597–604