2 Nucleic Acids Research, 2008
energy is the same for all DNA sequences and hence irrelevant for comparisons. This approach is so efficient that all possible sequences (4N for N bases) for short DNA operator sites can be evaluated. Their results successfully identify the experimental consensus sequence for a variety of DNA-binding proteins (9,10), and the ordering of binding free energies for DNA point mutations in several complexes (9). In this context, it was also noted that the actual binding energy computed via minimizations is incorrect and cannot be compared to experiments quantitatively.
Endres et al. (11) allowed protein side chains to explore rotamer conformations in their study of Zif268. Interestingly, the agreement with experiments becomes worse when rotamers are considered, which points to a potential bias of the approach towards sequences similar to the one on which the underlying experimental structure is based. Morozov et al. (12) predict binding affinities using energy measurements as well, they keep their structures rigid or allow them to relax and compare the two approaches. However, instead of considering their binding energies to be approximately equal to free energies as we do, they fit their energies to a few experimentally known free energies. They assign different weights to the energies involved, e.g. the Lennard-Jones or the electrostatic energy, and optimize the weights so that the sum matches the free energy. They proceed to study several transcription factors (TFs) and even find consensus sequence logos for two TFs whose structures they construct by homology modeling. In recent work, Donald et al. (7) focus on direct protein–DNA interactions. They study and compare a number of potentials and propose some that outperform the standard Amber potential. All these efforts represent pioneering work in the emerging field of structure-based predictions of TF specificity.
Here, we explore whether widely available MD force fields can be used to calculate the binding free energy from all-atom models of the protein–DNA complex. In contrast to some of the previous studies, we: (i) assess the power and limitations of the method in dealing with the roughly 106 decoy sites of bacterial genomes (by computing binding energies for representative mutations and assembling an energy-based weight matrix (EBWM), which is then used for the task); (ii) explore whether energy-minimization methods utilizing MD force fields can predict protein–DNA binding when DNA sites, or the protein, are mutated.
For our study, we focus on the purine repressor, PurR, from Escherichia coli, a well-characterized TF with more than 20 known sites in the genome. The purine repressor is a member of the sizable LacI family, which is often regarded as a model system for transcription regulation. The abundance of both experimental (13) and bioinformatics (14) data makes this an ideal target for testing structure-based prediction techniques, and to study their assets and drawbacks.
We demonstrate that generic MD tools predict favorable binding energies for known cognate sites. To quantify the power and limitations of this approach, we investigated the following: (i) can we recognize the cognate sites from a large set of decoys, and estimate the number of false positives? (ii) How does the performance in the above test compare with that of a motif obtained from the set of cognate sites by bioinformatic methods? By calculating binding energies we can also answer the following questions which are not addressable by bioinformatic means: (i) what is the relative importance to recognition of direct binding energies to indirect factors such as DNA bending? (ii) can the computed results for &Gbinding of mutations in DNA, and more importantly in the protein, be compared to experiment? [Bioinformatics data can also be converted to compute AGbinding for DNA, but not protein mutations as in ref. (1–4)].
To test the ability of the force field to discriminate between cognate sites and random decoys, we developed a procedure to speed up calculations and the screening of many sites. We find that a single cognate site can be discriminated from about 7000 random decoy sites. While such performance is impressive, it is insufficient to detect sites from the whole bacterial genome. In the comparisons of our results with experimental binding free energies for DNA and amino acid point mutations, we obtain the correct order of binding free energies of the mutants.
MATERIALS AND METHODS
The change in free energy due to protein–DNA interactions can be decomposed as
Gbinding = Gprotein-DNA complex
— Gfree DNA — Gfree protein
Clearly, Gbinding depends on both the particular DNA sequence and the protein. In order to simplify the problem from a computational point of view, it is often assumed that the differences in Gbinding for two different DNA sequences are dominated by differences in enthalpy. Entropic contributions are usually ignored since the entropy losses upon binding for both the fragment of DNA and the protein are likely not to depend significantly on the DNA sequence; hence
Gbinding N Ebinding
= Eprotein—DNA complex — Efree (straight) DNA
— Efree (unbound) protein
Furthermore, if DNA sequences bound by the same protein are compared and DE(DNA1, DNA2)binding = E(DNA1, Protein) — E(DNA2, Protein) is of interest, the term Efree (unbound) protein cancels out.
The energies of the molecules were measured after minimizing the energy of their structures using the AMBER software package, its force field and an implicit water model. The reference structure in this study is 1qpz (15), a wild-type PurR structure bound to DNA. The sequence of the DNA is also the consensus sequence obtained in the bioinformatics study of ref. (14) and we shall thus refer to it as the consensus sequence. The structure, depicted in Figure 1, was reduced to its 60 amino acid headpiece, and the DNA was trimmed to the 16-bp consensus sequence.
 |
1 |
2