We discuss the structural preferences of dinucleotides in contact and not in contact with proteins observed in the structures of DNA crystallized with regulatory and histone proteins. As “contact”, we define a distance between nucleotide and amino acid atoms shorter that 6.0 Å. The distance of 6 Å is selected to include dinucleotides, which contact protein via a water bridge, into the group of interacting dinucleotides. Water-mediated contacts are frequent and the involved nucleotides and amino acids have structural [9
] and dynamic [16
] characteristics similar to those of directly interacting residues. Data for dinucleotides directly interacting with proteins (interatomic distances ≤ 3.6 Å) can be found in Supplementary Table S3
. Tables of the CANA/sequence incidences for dinucleotides closer than 3.6 Å from amino acids are not discussed further because interpretation of these tables led to the same conclusions as structurally and statistically more robust data based on the limiting interaction distance of 6 Å shown in Figure 1
and Supplementary Table S3
3.2. CANA—Sequence Associations
Combining both the structural and sequential information in one matrix (Figure 1
) provides a much richer but also more complex picture of the interplay between dinucleotide behavior and the interacting partners. To make the analysis of data in Figure 1
visually more intelligible, we highlighted the low and high instances in color. For the numbers of occurrences (left column), the matrix elements containing less than 15% of the average are marked in blue; those with more than twice as many as the average are marked in red. For the matrices with SPR data, green indicates overpopulation and blue underpopulation with the corresponding probability less than 1.0 × 10−6
(probabilities are in supplementary material
). Albeit the signal levels used to highlight the data are subjective and arbitrary from the statistical point of view, their variation within fairly large limits does not change the observed patters as can be tested in Supplementary Table S3
. The present highlights, therefore, show the characteristic and, we believe, the most important features of the data.
Inspection of the matrices in Figure 1
reveals that except for the numerical prevalence of the canonical BI-DNA form in all four types of dinucleotides, the matrix for the dinucleotides in contact with regulatory proteins, Regulatory < 6 Å, has a different pattern from the other three. It has fairly populated the AAA letter and low numbers of BB2 and miB. The AAA letter is induced at recognition sites of certain transcription factors, where its presence is usually connected to large local deformations resulting in bent DNA duplex. Sharp bending of DNA is known to occur in DNA complexed to TATA box binding proteins, such as in the human TFIIB–TBP–DNA complex (PDB code 4roc [21
], Figure 4
b). Notable is that the letters AAA, and to a lesser degree A-B, are present in Regulatory < 6 Å in all sequences including A/T rich sequences known not to adopt this DNA form readily.
The Regulatory < 6 Å group differs from the rest also when the standardized Pearson residuals are inspected. It has a check-board pattern of over- and under-represented CANA/sequence matrix elements: e.g., AT and TT are over-represented in 2B1 and 3B1 and under-represented in B12 and BB2, CA behaves inversely. The presence of many significant values in the SPR matrix is caused by a large local variability within CANA rows and sequence columns. It means that interactions in this dinucleotide group, in which contacts are localized to short DNA segments, use a wide spectrum of structures including the A and mixed B/A conformers in a sequence-specific manner.
Structural variability of DNA bound to regulatory proteins is illustrated in Figure 4
a by showing the local helical bending of DNA bound to a few transcription factors. While the average bend of the dinucleotide group Regulatory < 6 Å is 2.2°, the actual values fluctuate wildly between 0° and tens of degrees. In contrast, DNA duplex spirals around the nucleosome core in almost two complete turns and obviously needs to be bent repeatedly in small increments sometimes described as “kink and slide states” [22
]. The helical axis of 80 DNA steps forming a circle would bend by 4.5° per step on average (360° divided by 80). The actual average of the local helical bend in the 15 NCP analyzed structures, 4.3°, is close to this value with fluctuations between 0° and 9° at individual dinucleotide steps (Figure 4
a). The helical axis bend was calculated by the Curves+ program [23
] as the per step parameter, Ax-bend.
DNA bending in complexes with regulatory proteins is in most cases realized by structures described by the CANA letters AAA and BB2; specifically, the A-DNA letter AAA is often found at the sites of severe DNA kinks bound to transcription factors (Figure 4
b). DNA bending in NCP is not realized by the A-DNA letter AAA, BB2 plays the essential role in its bending. However, the magnitude of the helical bend does not apparently correlate with distribution of the CANA letters (supplementary Table S4
). An important class of regulatory proteins, pioneer transcription factors, bind to the major groove of DNA wrapped in NCP [24
]. DNA complexes of some of these factors, e.g., structures with the PDB codes 1vtn [26
], 1puf [27
], and 4hje [28
], were included into our ensemble of analyzed structures. DNA in contact with protein in these structures shares properties typical for DNA in complexes with most other regulatory proteins, i.e., increased presence of dinucleotides with A-DNA-like features (in these cases the CANA letters A-B and B-A), and most significantly in the BII-DNA form (BB2 letter). Relatively short stretches of DNA in these structures, however, do not allow any deeper analysis of the interplay between the geometry of their major groove and helical bending.
Structures of dinucleotides in contact with histones (Histones < 6 Å group) are characteristic by the frequent occurrence of the letters miB and B12, but most of all by the letter BB2. Sequence preferences for none of these three letters are clear-cut but it fluctuates between high (CA, TG, GA) and low (AC, AT, TC) incidences for BB2. High incidences of the sequences CA and TG and low incidences of GA and TC might point to a preference of pyrimidine-purine over purine-purine and pyrimidine-pyrimidine sequences for BB2. The sequence fluctuation of incidences of BB2 is even higher in the group Regulatory < 6 Å than in Histones < 6 Å DNA, but the fluctuations have a different pattern than in Histones < 6 Å and have no easily explainable sequence pattern. The specific structural role of BB2 in the histone-wrapped DNA is discussed in detail below. The SPR matrix for dinucleotides of the Histones < 6 Å group shows just a few significant values, three of them in BB2. It is important to compare the CANA/sequence distributions of the Regulatory < 6 Å and Histones < 6 Å groups by their inter-group matrices (Supplementary Table S1
). They both show significant differences for the BB2 and miB letters pointing again to the essential role of BB2 for the DNA binding.
Both groups of dinucleotides that are not in contact to protein share one feature, high incidence of unassigned conformers NAN without any strong sequence preferences. Many of these dinucleotides are at the strand ends with sufficient freedom to adopt less common structural features, which may sometimes also be induced by the crystal packing forces. Especially in the histone structures, the distribution of the CANA letters does not represent the conformational preferences typical for a free DNA molecule as can be corroborated by a high incidence of conformers with untypical or undefined features, miB and NAN, and low fraction of the canonical BI-DNA, BBB. A question remains whether this is a real structural feature of DNA in NCP, a coincidence of still relatively small sample of available structures, or, as pointed out by a referee of this work, a consequence of poor electron density in the unbound regions of some of the NCP structures.
As the last explanation seems the most likely, we feel that the situation calls for the development of tools to direct the refinement protocols in direction of the optimal agreement with the electron density, but at the same time avoid bias by incorrect or incomplete constraints of the DNA geometry. Such an effort is apparent in the recent development of PHENIX [38
] and CCP4 programs REFMAC [39
] and EDSTATS [40
] that take an advantage of the earlier development of tools to correlate experimental and model electron density such as RSCC [41
] and the Uppsala Electron Density Server [42
]. The dinucleotide conformer classes (NtC) employed here can help to build more realistic geometric restraints and make the refinement of DNA structures more robust.
A problematic quality of the not-in-contact regions also limits the impact of analysis of protein-DNA binding because it hampers the significance of correlations between the structural behavior of the bound and unbound DNA segments. Further understanding of these correlations would be extremely useful. It could explain the role of exocyclic groups in the major and minor grooves that determine the deformability of the DNA, as it has been observed experimentally [43
], as well as extend our insight into the expected facilitation of the protein-DNA binding by sequence-specific deformability of the duplex [44
3.3. Periodicity of the Structural Behavior of DNA in the Nucleosome Core Particle
Our ensemble of the 15 NCP structures contains 32 DNA strands; each of which is about 145 nucleotides long. In all of these DNA strands, longer stretches of the same conformer (the same alphabet letter) are infrequent and dinucleotide steps often alternate between one or two BI and another B-DNA type, most often BII-DNA, or unassigned structure type. In terms of the CANA letters, BBB alternates with BB2, miB, and NAN. These alterations differ slightly from NCP structure to structure and suggest no obvious periodicity.
In an attempt to reveal the possible periodicity of the duplex bending wrapped around the histone proteins, we Fourier-transformed the presence of CANA letters, dinucleotide sequences, minor and major grove width, and helical bend as a function of the position along the strands (Figure 5
). We employed discrete Fourier transformations as implemented in the fft function of Scilab package.
Firstly, we Fourier-transformed each of the ten CANA letters against the remaining nine: the analyzed CANA letter was assigned the value of 1, the other letters the value of 0 and so called periodograms were calculated for all 32 DNA strands. In each NCP strand, we observed a strong signal with a periodicity of about 10 steps for the letter BB2 and the signal became prominent after averaging all 32 periodograms. No other CANA letter provided a signal of significant intensity. The exact periodicity of the BB2 signal slightly depends on details of the analysis, the average value is 10.3 steps. Because a B-DNA duplex makes one full turn each ~11 nucleotides (~10 steps), the discovered periodicity in the structure characterized by the CANA letter BB2 occurring every duplex turn explains how the DNA wrapping is carried out by the backbone atoms.
Further, we investigated whether any of the 16 dinucleotide sequences provides a periodic signal but we obtained no significant response. The situation is surprising especially for the TA sequence because several previously published studies have indicated a certain ability of the TA sequence to potentiate DNA binding to NCP. A clear and strong sequence signal has been observed by Lowary and Widom for the TA sequence [45
]. In their thorough study, they have subjected DNA oligonucleotides to SELEX directed evolution to identify sequences binding with the highest affinity to the histone core [45
]. Their analysis, based on Fourier-transforming the resulting sequences, has convincingly demonstrated that the optimal binding between the histone proteins and DNA is achieved for DNA with TA steps dispersed regularly every 10 to 11 steps. An important independent confirmation of the preference for the TA periodicity has been shown in a recent genome-wide study [46
]. The sequence periodicity is accepted as an important factor of nucleosome positioning despite its weak pronouncement; several positioning patterns facilitating the bends were suggested including the 10–11 base pair periodicities of AA–TT–TA/GC dinucleotides [47
] or R5Y5 positioning motif [48
The lack of evidence supporting the periodic presence of the TA sequence in NCP structures is even more puzzling because the BB2 letter, which does behave periodically, is overpopulated in the TA sequence, but in Regulatory < 6 Å, not in NCP structures (Figure 1
). The periodic placement of TA or any other sequence seems therefore not the condition but a preference strengthening the binding of DNA in NCP.
Fourier-transformation of the minor groove widths provides a strong signal with the same periodicity as BB2 (Figure 5
). The groove width is, however, a consequence of the bending, not its structural carrier as the periodicity of the backbone conformational behavior described above. Also, the previously reported periodic alteration of twist, roll, and tilt [50
] is a consequence but not the cause of the bending: “this is only an indirect description that does not address the underlying localized constraints on double helix structure, which moreover arise from a form of protein association that is unique to the nucleosome” [51
]. Neither the width of the major groove nor the local helical bend provided periodic signal despite that especially the values of the helical bend oscillate. The values oscillate but the oscillations are apparently not periodic.