Predicting Transcription Factor Specificity with All-Atom Models
Sahand Jamal Rahi

Recognized HTML document

Nucleic Acids Research Advance Access published October 1, 2008

Nucleic Acids Research, 2008, 1–9 doi:10.1093/nar/gkn589

Predicting transcription factor specificity with

all-atom models

Sahand Jamal Rahi1, Peter Virnau2,*, Leonid A. Mirny1,3 and Mehran Kardar1

1Department of Physics, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA, 2Staudinger Weg 7, Institut fu¨r Physik, 55099 Mainz, Germany and 3Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA

Received May 4, 2008; Revised August 29, 2008; Accepted September 2, 2008

ABSTRACT

The binding of a transcription factor (TF) to a DNA operator site can initiate or repress the expression of a gene. Computational prediction of sites recognized by a TF has traditionally relied upon knowledge of several cognate sites, rather than an ab initio approach. Here, we examine the possibility of using structure-based energy calculations that require no knowledge of bound sites but rather start with the structure of a protein–DNA complex. We study the PurR Escherichia coli TF, and explore to which extent atomistic models of protein–DNA complexes can be used to distinguish between cognate and noncognate DNA sites. Particular emphasis is placed on systematic evaluation of this approach by comparing its performance with bioinformatic methods, by testing it against random decoys and sites of homologous TFs. We also examine a set of experimental mutations in both DNA and the protein. Using our explicit estimates of energy, we show that the specificity for PurR is dominated by direct protein–DNA interactions, and weakly influenced by bending of DNA.

INTRODUCTION

Binding of cognate sites of DNA is central to many essential biological processes. Most of DNA-binding proteins have the ability to recognize and tightly bind cognate DNA sequences (sites). To find sites bound by a particular DNA-binding protein, one needs to calculate the free energy of binding for the protein and possible DNA sites and then select sites that provide sufficiently low-binding energy. A widely used approach to find sites for a DNA-binding protein is to assume a form of the energy function, infer its parameters and to then calculate the energy for all sites in a genome. To infer parameters one

needs to have a set of known sites bound by the protein. Given these known sites, the parameters are inferred using either a widely used Berg–von Hippel approximation (1), or by other recently proposed methods (2–4). This constitutes a physical basis for many widely used bioinformatics techniques that rely on a particular form of the energy function known as a position-specific weight matrix (PWM).

All these methods require a priori knowledge of the sites (or at least longer sequences containing these sites) bound by the protein. These data are available for only a small number of DNA-binding proteins. For many DNA-binding proteins, however, their sequence of amino acids is well known. Sufficiently high-evolutionary conservation of DNA-binding domains, and the availability of crystal structures for many of them, makes it possible to construct 3D models for a broad range of DNA-binding proteins. Can such protein structures be used to predict sites recognized by a DNA-binding protein? The basic procedure for structure-based methods is to compute the binding energy of the protein–DNA complex. The structure of the complex for an arbitrary DNA sequence can be modeled by replacing (‘mutating’) the DNA sequence in the protein structure containing its cognate site, followed by energy minimization and/or molecular dynamics (MD) to allow the protein–DNA complex to adjust to the new DNA sequence. After several minimization steps, the interaction energy can be calculated using either standard molecular mechanics force fields like AMBER (5) or CHARMM (6) with an implicit solvent, or a knowledge-based force field optimized for the particular complex (7).

Several recent studies have significantly elaborated upon the above procedure. Lafontaine and Lavery (8), for example, pioneered a very efficient process termed ADAPT in which they replace the DNA in the structure by a ‘multicopy’ or ‘average’ piece of DNA. The structure is only minimized once after which the energy of the complex is measured for all possible DNA sequences in place of the average piece. From this, only the energy of the unbound DNA must be subtracted. The unbound protein

*To whom correspondence should be addressed. Tel: +49 6131 392 3646; Fax: +49 6131 392 5441; Email: virnau@uni-mainz.de

© 2008 The Author(s)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

2 Nucleic Acids Research, 2008

energy is the same for all DNA sequences and hence irrelevant for comparisons. This approach is so efficient that all possible sequences (4N for N bases) for short DNA operator sites can be evaluated. Their results successfully identify the experimental consensus sequence for a variety of DNA-binding proteins (9,10), and the ordering of binding free energies for DNA point mutations in several complexes (9). In this context, it was also noted that the actual binding energy computed via minimizations is incorrect and cannot be compared to experiments quantitatively.

Endres et al. (11) allowed protein side chains to explore rotamer conformations in their study of Zif268. Interestingly, the agreement with experiments becomes worse when rotamers are considered, which points to a potential bias of the approach towards sequences similar to the one on which the underlying experimental structure is based. Morozov et al. (12) predict binding affinities using energy measurements as well, they keep their structures rigid or allow them to relax and compare the two approaches. However, instead of considering their binding energies to be approximately equal to free energies as we do, they fit their energies to a few experimentally known free energies. They assign different weights to the energies involved, e.g. the Lennard-Jones or the electrostatic energy, and optimize the weights so that the sum matches the free energy. They proceed to study several transcription factors (TFs) and even find consensus sequence logos for two TFs whose structures they construct by homology modeling. In recent work, Donald et al. (7) focus on direct protein–DNA interactions. They study and compare a number of potentials and propose some that outperform the standard Amber potential. All these efforts represent pioneering work in the emerging field of structure-based predictions of TF specificity.

Here, we explore whether widely available MD force fields can be used to calculate the binding free energy from all-atom models of the protein–DNA complex. In contrast to some of the previous studies, we: (i) assess the power and limitations of the method in dealing with the roughly 106 decoy sites of bacterial genomes (by computing binding energies for representative mutations and assembling an energy-based weight matrix (EBWM), which is then used for the task); (ii) explore whether energy-minimization methods utilizing MD force fields can predict protein–DNA binding when DNA sites, or the protein, are mutated.

For our study, we focus on the purine repressor, PurR, from Escherichia coli, a well-characterized TF with more than 20 known sites in the genome. The purine repressor is a member of the sizable LacI family, which is often regarded as a model system for transcription regulation. The abundance of both experimental (13) and bioinformatics (14) data makes this an ideal target for testing structure-based prediction techniques, and to study their assets and drawbacks.

We demonstrate that generic MD tools predict favorable binding energies for known cognate sites. To quantify the power and limitations of this approach, we investigated the following: (i) can we recognize the cognate sites from a large set of decoys, and estimate the number of false positives? (ii) How does the performance in the above test compare with that of a motif obtained from the set of cognate sites by bioinformatic methods? By calculating binding energies we can also answer the following questions which are not addressable by bioinformatic means: (i) what is the relative importance to recognition of direct binding energies to indirect factors such as DNA bending? (ii) can the computed results for &Gbinding of mutations in DNA, and more importantly in the protein, be compared to experiment? [Bioinformatics data can also be converted to compute AGbinding for DNA, but not protein mutations as in ref. (1–4)].

To test the ability of the force field to discriminate between cognate sites and random decoys, we developed a procedure to speed up calculations and the screening of many sites. We find that a single cognate site can be discriminated from about 7000 random decoy sites. While such performance is impressive, it is insufficient to detect sites from the whole bacterial genome. In the comparisons of our results with experimental binding free energies for DNA and amino acid point mutations, we obtain the correct order of binding free energies of the mutants.

MATERIALS AND METHODS

The change in free energy due to protein–DNA interactions can be decomposed as

Gbinding = Gprotein-DNA complex

Gfree DNA Gfree protein

Clearly, Gbinding depends on both the particular DNA sequence and the protein. In order to simplify the problem from a computational point of view, it is often assumed that the differences in Gbinding for two different DNA sequences are dominated by differences in enthalpy. Entropic contributions are usually ignored since the entropy losses upon binding for both the fragment of DNA and the protein are likely not to depend significantly on the DNA sequence; hence

Gbinding N Ebinding

= EproteinDNA complex Efree (straight) DNA

Efree (unbound) protein

Furthermore, if DNA sequences bound by the same protein are compared and DE(DNA1, DNA2)binding = E(DNA1, Protein) — E(DNA2, Protein) is of interest, the term Efree (unbound) protein cancels out.

The energies of the molecules were measured after minimizing the energy of their structures using the AMBER software package, its force field and an implicit water model. The reference structure in this study is 1qpz (15), a wild-type PurR structure bound to DNA. The sequence of the DNA is also the consensus sequence obtained in the bioinformatics study of ref. (14) and we shall thus refer to it as the consensus sequence. The structure, depicted in Figure 1, was reduced to its 60 amino acid headpiece, and the DNA was trimmed to the 16-bp consensus sequence.

 
Picture

1

2

Nucleic Acids Research, 2008 3

Picture

Figure 1. PurR protein headpiece bound to its consensus sequence DNA. This structure (15) serves as the basis of our study. The DNA base pairs or the protein amino acids in this structure are mutated on the computer and the effects on the binding energy measured. Blue and red: protein chains; orange and gray: DNA.

The first amino acid is missing and was not inserted artificially. The reference for straight DNA was taken and trimmed from the first model of the noncognate LacI– DNA binding complex 1osl (16). DNA sequences were exchanged with the 3DNA computer application (17). The experimental DNA backbone remained in place, only the base pairs were replaced. The free DNA molecule obtained in this manner deviates from a ‘perfect’ B-DNA molecule by about 1A˚ RMS. While the experimentally derived straight DNA molecule was preferred over one with average coordinates, this choice had no significant impact on our results. The SD of the energy difference between the canonical B-DNA structures and ours is merely 1.2 kcal/mol for the 50 random DNA sequences that were used (see subsequently). Also, for example, the linear correlation coefficient of 0.6 between bioinformatics scores and binding energies for the random sequences discussed subsequently does not change at all. Protein mutants were generated with the Mutator 1.0 plugin built into VMD (18). The software uses psfgen to build a new side chain from pre-defined parameters for the CHARMM force field; this structure is not relaxed further by VMD. But the mutated side chain assumes a low-energy conformation during energy minimization because—unlike the original residues—mutated residues were not constrained to remain close to the coordinates of the original structure (see subsequently). For the study of DNA point mutations, the respective structures 1qp0, 1qp4, 1bdh, 1qqb, 1qp7 and 1qqa from Ref. (15) were used in addition to 1qmz. We applied psfgen to combine and prepare structures for minimization.

It should be noted that the conformational energy of the free unbound protein structure [Equations (2) and (6)] was not considered in most cases because we were only interested in differences between complexes. For example, AEprotein deform [Equation (6)] is simply the energy difference between the two bound protein structures. In our investigation of amino acid mutations, we approximated Eunbound protein in [Equation (2)] with Ebound protein, i.e. the self-energy of the protein in the bound complex. Again, this approximation is reasonable as we were only interested in differences of the binding energy between mutant complexes.

For all computations, we used the Amber 9 program with the parm99 force field (5), and the second implicit water model from Ref. (19). No cut-off was applied. Hydrogen atoms and the nucleic bases, as well as substituted residues in our amino acid mutation study, were allowed to rearrange freely to eliminate steric clashes. The movement of the protein and the DNA backbone was restricted by springs with a spring constant of 1.0 kcal/(molA˚ 2). To each configuration, 2500 steepest-descent and 2500 conjugate-gradient minimization steps were applied before energies were calculated to ensure convergence. A typical minimization run for a protein– DNA complex took about 4 h on a 3 Ghz Pentium 4 desktop computer.

While the relaxation of the structures is an essential element of our method, we cannot allow energy minimization to proceed unhindered. This is because: (i) we do not fully trust the potentials and (ii) the finite temperature fluctuations (not included) may prevent the structure moving into certain energy wells. As mentioned in the Introduction Section, previous work has indicated that the more the complex is allowed to move away from the known experimental structure (11,12), the less reliable are the energy-based methods in predicting binding specificity. The springs introduced in the previous paragraph limit the drift of the structure, but their strength is an additional parameter of the problem. In practice, for the spring constant we employ, the RMSD of the protein backbone changes by about 0.4 A˚ from the native structure. Fortunately, we find that the relevant aspects of the binding, namely the relative preferences to different sequences, are independent of the choice of the spring constant as long as the structures’ integrity is preserved. This conclusion was reached after performing studies with spring constants of 1.0, 2.5, 5.0 and 7.5 kcal/(molA˚ 2).

RESULTS

Comparison with bioinformatics scores

In order to assess the quality of binding predictions based on the all-atom calculations, we compared them with predictions made using a bioinformatic technique. The PurR TF has been studied extensively and is therefore particularly well suited for this task. Mironov and co-workers (14) compiled a collection of 21 binding sites to which PurR is considered to bind in E. coli. Assuming independence of the influence of different base pairs on specificity, they set up a PWM that we use to calculate bioinformatics scores for various DNA sequences. Given a sufficient number of known sites, PWM scores provide a good approximation of experimentally measured binding energies (20–22) and have sufficient specificity to detect binding sites in bacterial genomes (23).

We challenged our structure-based approach, which uses only one known site that is a part of the crystal structure, to detect cognate sites among random ones using the binding energies after minimization. These energies were also compared to the PWM scores. In particular,

4 Nucleic Acids Research, 2008

we examined the consensus sequence, the 21 binding sequences, 50 random sequences and several binding sites of closely related TFs FruR, GalR/GalS and MalI. Sites of homologous TFs were chosen because they constitute particularly challenging sites that are similar to PurR cognate sites and share the same palindromic structure. As shown in Figure 2, bioinformatics scores and binding energies correlate well with a linear correlation coefficient of 0:6 for the random sequences and 0:8 for all sequences displayed. The bioinformatics consensus sequence has the second lowest energy and random non-cognate sequences are generally well separated from cognate sequences. While the separation between cognate sites and the 50 random decoys is reassuring, it is important to find out whether the procedure is able to find cognate sites among 106 other sites (decoys) on the bacterial genome.

Assuming a Gaussian distribution of the binding energies for random sites, we can estimate the number of decoys that have binding energies comparable to the cognate sites. The distance between the average of the random sequences (red line) and the third worst cognate site (black line) is 3.63 a. (We chose the third worst cognate site because the two next sequences are the PurA operator sites, see next paragraph.) This roughly amounts to one false positive hit in 7000 random sites. Note that 50 random sequences can only yield a rough estimate for this number. This number is quite encouraging although it should only be considered a rough estimate. For comparison, the corresponding PWM bioinformatics scores from Ref. (14) are separated by 4.55 a which would amount to one false positive hit in 370000 sequences. (The K12 E. coli genome (24) consists of 4.64 Mbps).

 

 

consensus sequence cognate sequences

 

PurA operator random nonspecific sequences

 

FruR

 

GalR/GalS

 

MalI

 

 

 

 

0   20   40   60   80

energy (kcal/mol)

Figure 2. Bioinformatics score versus energy. All binding energies are shown relative to the binding energy of the consensus sequence seqc (blue circle) at 0 kcal/mol. Black circles: 21 binding sequences selected by Mironov et al. (14), PurA sequences are unfilled. Red circles: random noncognate sequences selected from the E. coli genome. Green, indigo and orange triangles: FruR, GalR/GalS and MalI operator sites. The solid red lines indicate the average energy or average bioinformatics score for the random sequences; the dashed lines mark the first SD. The solid black line goes through the data point for the third worst cognate sequence (a black circle). The two cognate sequences with even worse binding energies (hollow black circle) are controversial binding sites. The linear correlation coefficient is 0.6 for the random sequences and 0.8 for all sequences displayed.

The two cognate sequences with the highest energies are the two PurA operators (unfilled black circles). Indeed, the suggestion that the PurA operon may be regulated by PurR is controversial (14,25,26). Although at the lower end of the spectrum, the bioinformatics scores for these sites are comparable with other cognate sites, while our computations give distinctly higher binding energies.

Testing the energy-based approach on operator sites regulated by other members of the LacI family is a particularly challenging task. Although the FruR and MalI binding sequences are energetically well separated from the PurR cognate sequences, GalR/GalS binding sequences are not. (The bioinformatics score appears to have less difficulty with these sites.)

Finally, we would like to point out that the absolute energy scale is incorrect, in line with the conclusions of Ref. (9). Excluding PurA, the range of binding energies for cognate sequences is 10 kcal/mol which is clearly too large. The underlying assumptions and approximations of the method are, however, quite considerable and quantitative agreement cannot really be expected.

Direct and indirect contributions to the binding energy, sequencelogos

The binding free energies can be subdivided into two parts: direct interactions between the TF and DNA, and indirect contributions due to sequence-specific DNA bending. In recent work, Paillard and Lavery (9) noted that the level of each contribution varies significantly from complex to complex. Their method is based on a careful analysis of a subset of sequences with particularly low-binding energies, after having computed the energies of all possible 4N sequences. Here, we propose a simple method which can distinguish between contributions of bending and protein–DNA interactions on the basis of a rather limited set of measurements.

To understand the source of the sequence specificity of PurR, we partitioned its binding energy as follows:

Ebinding = Einteraction

   + EDNA deform ;   3
+ Eprotein deform

where

Einteraction = EproteinDNA complex

   — Ebound protein   ;   4
Ebound DNA

EDNA deform = Ebound DNA   ;5

Estraight DNA

and

Eprotein deform = Ebound protein   :   6

Eunbound protein

6

4

2

bioinformatics score

0

2

Nucleic Acids Research, 2008 5

We next computed direct and indirect contributions for both cognate and random sequences and compare the differences: on average, Einteraction was lower by 34 kcal/mol for the cognate sites compared to random ones and EDNA deform lower by 7 kcal/mol. Assuming that the force field reproduces the correct ratios of direct and indirect contributions, specificity towards PurR is predominantly determined by protein–DNA interactions. It is interesting to note that Eprotein deform was slightly higher for cognate sites (2 kcal/mol) indicating that the interactions were strong enough to bend the protein towards a slightly unfavorable position.

To study the contribution of individual base pairs to specificity, and to significantly speed up computations, we used energy minimization to calculate a position-specific energy matrix, analogous to PWM (14). As specificity towards PurR is dominated by direct, pairwise interactions, we computed the change in Einteraction due to each possible single mutation of the consensus sequence and set up an EBWM (Table 1). (While it is in principle possible to construct a statistical weight matrix based on the top 21 sites identified by energy minimization, this would effectively reproduce the experimental PWM of Ref. (14). The interaction energy for an arbitrary DNA sequence can now be computed by adding the appropriate base pair energies. This requires only a limited number of computations at the cost of being less accurate.

Although computationally efficient, both EBWM and PWM methods are based on the assumption that the contributions of individual base pairs are independent from each other, neglecting many-body effects, such as due to solvation. EBWM calculated using Einteraction also ignores sequence-dependent contributions of DNA deformation

Table 1. Position-specific energy matrices based on direct interaction energies and interaction energies plus bending corrections

 

 

DEinteraction

 

 

DEinteraction + DEDNA deform

A

C

G

T

A

C

G

T

0.0

-0.5

-0.7

-0.7

0.0

0.3

-0.6

0.2

1.3

0.0

0.4

-0.6

1.9

0.0

0.0

0.4

4.7

14.3

0.0

8.4

7.6

19.6

0.0

15.1

-1.0

0.0

2.3

1.2

2.3

0.0

2.4

0.1

0.0

2.1

3.7

3.4

0.0

1.6

2.2

4.2

0.0

3.5

3.4

3.8

0.0

4.4

4.5

5.7

0.0

2.0

0.6

2.1

0.0

0.5

2.8

0.4

6.2

0.0

0.1

5.2

2.3

0.0

-0.6

3.3

5.3

0.1

0.0

6.0

3.8

-0.8

0.0

1.4

2.2

0.6

2.1

0.0

0.4

2.7

0.3

0.0

3.9

3.4

3.6

0.0

8.4

4.8

6.7

0.0

3.4

3.8

2.1

0.0

3.3

2.4

0.3

0.0

0.9

2.3

0.0

-0.9

-0.2

2.1

0.0

2.2

8.4

0.0

14.4

4.6

14.9

0.0

19.3

7.8

-0.6

0.6

0.0

1.4

-0.1

0.2

0.0

1.8

-0.7

-0.6

-0.6

0.0