Partial-occupancy binders identified by the Pan-Dataset Density Analysis method offer new chemical opportunities and reveal cryptic binding sites

Crystallographic fragment screening uses low molecular weight compounds to probe the protein surface and although individual protein-fragment interactions are high quality, fragments commonly bind at low occupancy, historically making identification difficult. However, our new Pan-Dataset Density Analysis method readily identifies binders missed by conventional analysis: for fragment screening data of lysine-specific demethylase 4D (KDM4D), the hit rate increased from 0.9% to 10.6%. Previously unidentified fragments reveal multiple binding sites and demonstrate: the versatility of crystallographic fragment screening; that surprisingly large conformational changes are possible in crystals; and that low crystallographic occupancy does not by itself reflect a protein-ligand complex's significance.


I. INTRODUCTION
Single-compound high-concentration crystallographic fragment screening experiments are now being increasingly applied to detect the binding of small molecules-"fragments"-to proteins. [1][2][3] These experiments entail soaking libraries of fragments with molecular weights of 150-250 Da into protein crystals and determine the resulting structures by X-ray crystallography. 4 The small size of fragments (10 heavy atoms) allows relatively small libraries to efficiently cover chemical space. 5 However, the small size of fragments means that any binding is inherently weak; 6 crystallographic occupancies of bound fragments are frequently well below 100%, even when single compounds are soaked at high concentrations. 2 The presence of crystallographic states may also be substantially affected by crystal effects, e.g., where a conformational change is frustrated by crystal packing constraints, or by crystal cryocooling, which may prevent the detection of "cryptic" binding sites. 7 Partial occupancy makes it difficult to objectively identify and model bound fragments, since the crystallographic density consists of an average over the bound and unbound states: superposed density corresponding to the unbound state of the crystal may completely obscure evidence for the bound state or merely impede interpretation. 8 If relying on conventional signal-identification and modelling methods, obtaining clear and interpretable density for the state of interest in general requires time-consuming optimization of crystal systems and/or experimental protocols, and even that may not help. 9 The problems associated with partial occupancy may be overcome by application of the Pan-Dataset Density Analysis (PanDDA) method, 2 which reveals clear and interpretable evidence for minor states of the crystal. Through a voxel-by-voxel analysis of the electron densities from multiple crystallographic datasets, binding ligands are identified by contrasting bound datasets against unbound datasets; this allows ligands to be identified with statistical confidence, analogously to isomorphous difference maps. 10 Once an interesting region has been identified in a particular dataset, the observed electron density at the site remains a superposition of the bound and unbound states of the crystal; subtraction of the superposed unbound density from the crystallographic density subsequently reveals clear density for only the bound state, ensuring that it can be modelled comparatively simply.
We obtain an accurate representation of the density for the unbound state of the crystal from the analysis of multiple unbound datasets; subtracting this density, suitably weighted by an appropriate Background Density Correction factor (BDC), removes the crystallographic superposition and creates a partial-difference map called an event map. The event map approximates the electron density for the bound state of the crystal and is thus used for modelling of the bound state.
The PanDDA methodology not only identifies the binding of partial-occupancy ligands but also reveals the ligand-associated conformation of the protein, regardless of how large a conformational change is exhibited upon binding. The PanDDA approach may be applied to any collection of crystallographic datasets of the same crystal form, where a series of unbound datasets are available to provide contrast and thereby identify signal in bound datasets.

II. DATA AND METHODS
Histone Lysine-specific demethylase 4D (KDM4D) is an epigenetic protein involved in oncology, inflammation, and drug metabolism, and whose down-regulation affects cell proliferation in colon cancer cells. 11,12 Diffraction data were collected and analyzed for 226 KDM4D crystals, each of which contained a different compound from the Zenobia fragment library (www.zenobiafragments.com). Manual inspection of the conventional difference (mF o -DF c ) maps after refinement with a reference model resulted in the identification of two binding compounds, which were subsequently deposited in the Protein Data Bank (PDB) 13 (PDB codes 4D6R and 4D6S).
A full description of the modelling and refinement of this dataset may be found as part of the original PanDDA manuscript; 2 however, a brief description of the process is as follows. After automated processing using the PanDDA method with the default parameters, ligandbound conformations in identified datasets were modelled into the event maps and then merged with a model of the unbound state of the crystal, using alternate conformers to label the different states of the crystal. This process creates an ensemble model of the crystal; the alternate states are crucial to stabilize the refinement of the ligand-bound state. 8 The resulting ensemble models were refined using typical resolution-dependent protocols, constraining the occupancies of the bound and unbound states to sum to unity where appropriate.
For consistency with the original PanDDA manuscript, the protein is herein referred to as JMJD2D.

III. RESULTS
Initially, only two ligands were identified to bind to the protein by conventional analysis (datasets x378 and x637), a hit rate of 0.9%. In contrast, analysis with the PanDDA method revealed many partial-occupancy binders that were not visible in the standard 2mF o -DF c and mF o -DF c crystallographic maps: 37 binding ligands were detected in 24 datasets, increasing the hit rate by tenfold to 10.6%. Crystallographic data are summarized in Table I; the modelled compounds in each crystal are listed in supplementary material, Table A1. The majority of ligands bind with less than 70% occupancy, but the quality of the ligand models remains high, with real-space correlation coefficients (RSCC) greater than 0.75 ( Figure 1). Individual validation plots for each crystallographic ligand, utilizing multiple validation metrics, 8 may be found in supplementary material, Table A2.
Fragments bind at many sites on the surface of the protein ( Figure 2), but most binders are singletons, form very few specific interactions with the protein and are likely of little biological relevance (grey ligands; Figure 2). However, there are also several sites that are multiply occupied, the three most significant of which are discussed in this work: the peptide binding region (site A); a putative allosteric pocket on the reverse of the protein (site B); and a cryptic binding site requiring a large conformational change in the terminal alpha helix of the protein (site C).

A. Binding in and around the orthosteric binding site
The orthosteric binding site of JMJD2D is occupied by a molecule of N-oxalylglycine (NOG) that, along with the binding site metal, is present at approximately 80%-90% occupancy; this is supported by occupancy refinement of the NOG molecule across the datasets (supplementary material, Figure A1). One fragment (dataset x401) binds in the space vacated by the NOG molecule, but in a perpendicular orientation, and further induces a conformational change in phenylalanine 189 (Figures 3(a)-3(c)). Histidines 192 and 280, which coordinate the binding site metal, become less ordered where the metal and the NOG are absent, but do not change conformation.
Two further fragments (datasets x443 and x620) also bind along the peptide binding groove of JMJD2D (Figures 3(d)-3(f), 3(g)-3(i)); the binding events are not at all evident in the original crystallographic maps but are made clear in the PanDDA maps. Both fragments interact with the protein chiefly via hydrophobic interactions, but also make several hydrogen bonds to surrounding water molecules and protein residues. All three compounds sample different parts of the binding groove, yet bind in close proximity (Figure 4), inviting the possibility of linking the fragments.

B. Detection of a putative allosteric binding site
Seven ligands were observed to bind at a previously unidentified putative allosteric site that is spatially close to the orthosteric site, but accessible from the reverse face of the protein (site B, Figure 2). The binding site consists of a small pocket and a shallow groove on the surface, binding five and two fragments, respectively ( Figure 5). The two fragments that bind in the shallow groove (x378 and x494) both contain an imidazole ring, which is present at the same site in both datasets: the conservation of the nitrogen atom positions in the ring indicates their importance to the binding of the two compounds.
Two examples of fragments bound in the small pocket are shown for datasets x555 and x376 (Figures 6(a)-6(f)), along with the unbound conformation of the binding site ( Figure  6(g)). These binding fragments perturb the sidechains of residues in and around the binding site, sampling conformations of arginine 228 and phenylalanine 241 not visible in the unbound dataset (Figures 6(b) and 6(e)). Other residues are conserved upon binding, except for serine 308, whose conformation is sub-selected from the two conformations observed in the unbound conformation.
The ability of the site to bind multiple ligands displaying similar chemotypes and the short distance to the orthosteric site together indicate that the site may offer an opportunity to allosterically regulate the functionality of the protein. Additionally, the site hosts two adjacent clusters of fragments, suggesting fragment linking as a possible approach for elaborating the bound molecules.

C. Large binding-stabilized conformational changes
Several fragments stabilize a large conformational change in the terminal helix of JMJD2D (Site C, Figure 2). All binding events are partial occupancy (refined occupancies 0.39-0.66) and can only be reliably identified by and modelled into PanDDA event maps (Figure 7). Structures were refined as an ensemble of the bound and unbound states except for dataset x395, which is modelled only in the bound conformation; this is due to instabilities in the B-factor and occupancy refinement of the ensemble model, which resulted in significant amounts of residual difference density and inflated model B-factors. Even so, the ligand-bound conformation for dataset x395 is clear in the event maps and does not move under refinement when refined as the sole conformation (refined occupancy 0.66). Furthermore, the compound in x395 differs from the ligand in dataset x393 only by one methyl group and adopts the same binding pose, increasing confidence that the model of the ligand is largely correct.
The bound conformations of the protein form two distinct subgroups, based on the pose of tryptophan 336 (Figure 8). In datasets x393 and x395 (Figures 8(b) and 8(c)), the rotamer of tryptophan 336 is unchanged from the unbound conformation, and only the backbone conformation of the helix changes. Conversely, in datasets x365, x402, and x623 (Figures 8(a), 8(d), 8(e)), tryptophan 336 adopts a rotamer that forms horizontal pi-stacking interactions.
The sidechain of arginine 316 is disordered in all bound structures other than dataset x623, where residues 316-322 adopt a novel conformation, allowing arginine 316 and a new rotamer of aspartate 322 to form a salt bridge (supplementary material Figure A2(e), Figure 8(e)). Arginine 332 is observed to adopt a new conformation in dataset x402, forming interactions with the exposed oxygen atom of the ligand (Figure 8(d)). Residues up to aspartate 341 were resolved in the unbound conformation, but only residues up to glycine 339 were resolved in the PanDDA event maps for the bound conformation for all structures.

IV. DISCUSSION
In this work, the analysis of one set of crystallographic fragment screening data of lysinespecific demethylase 4D has led to the discovery of a cryptic binding site, a possible new allosteric site, and a more complete mapping of the orthosteric site that shows an opportunity for fragment linking. The previously unknown sites are highly likely to be allosterically relevant, especially since binding occurs near to known functionally important sites. For fragment linking or chemical extension, any binders that probe the chemical potential of nearby regions add further information to guide exploratory efforts.
Crystallographic fragment screening combined with sensitive data-analysis methods has the power to generate large amounts of structural information and reveal exciting structural events for use in protein studies and medicinal chemistry efforts. The management and interpretation of large amounts of structural information poses its own problems; tools such as WONKA 14 can be used to condense structural information and aid interpretability.
In this work, we show that low occupancy of bound fragments is not an indication of the low value of the corresponding bound structures, but rather the un-optimized nature of fragment screening and restrictions of the crystalline state. Where binding causes energetically unfavorable FIG. 4. Even the small number of hits sample distinct regions of the orthosteric site. Overview of the binding groove (site A; Figure 2) and the associated ethylene glycol molecules (molecules colored by dataset). The ligand of dataset x401 samples the deep binding pocket, whilst the ligands in x443 and x620 sample sites in the peptide binding groove. The surface of the protein is shown in semi-transparent gray. Images: PyMOL.
FIG. 5. Weakly bound fragments identify a putative allosteric site, with repeated chemotypes. Fragment binding is observed at a putative allosteric site (site B; Figure 2) located on the reverse of the protein from, but still near to, the catalytic site. Five fragments bind in a small pocket, and two bind in a shallow groove on the protein surface. The binding fragments demonstrate repeated chemotypes, such as the superposed 5-membered imidazole rings in datasets x378 and x494. The binding fragment in x386 does not match the soaked fragment; several fragments were misdispensed, and the density is highly similar to that in dataset x376, which also contained a bound ligand. PyMOL colours of carbon atoms of fragments: x376, green; x378, cyan; x386, light magenta; x392, yellow; x396, salmon; x494, grey; x555, slate blue. All other atoms are colored by element. Images: PyMOL.
conformations that may be frustrated by crystal packing, such as the cryptic site discovered here, any binding must be inherently impeded, resulting in reduced occupancy.
Beyond fragment screening, improved data analysis methods such as PanDDA offer a range of new possibilities for probing protein structures and identifying minor states. Historically, signal detection has required large fractions of the crystal to adopt the state of interest; the PanDDA method makes this unnecessary and avoids time-consuming crystal optimization for observing structural events. Furthermore, we have demonstrated that large conformational  IDs: 5PH0, 5PH1, 5PH2, 5PH3, 5PH4, 5PH5, 5PH6, 5PH7, 5PH8,  5PH9, 5PHA, 5PHB, 5PHC, 5PHD, 5PHE, 5PHF, 5PHG, 5PHH, 5PHI, 5PHJ, 5PHK, 5PHL,  5PHM, 5PHN). An interactive summary of all the ligand-bound structures is available at https://zenodo.org/record/290220/files/0_index.html, which allows users to easily navigate and download the ligand-bound structures: the appropriate event maps are automatically loaded for each structure, which are crucial for making sense of the models. All crystallographic fragment screening datasets are also available on Zenodo (https://zenodo.org/record/48770), to allow reproduction of the PanDDA analysis.
Several conformer-labelling errors have been identified since the publication of the data on Zenodo-though the atomic positions are correct, the conformers corresponding to the bound ligand may not be assigned to the appropriate conformer ID. For example, for a bound ligand with conformer D, the associated conformation may be labelled A or B. These conformer assignments have been corrected in structures deposited in the PDB.
The PanDDA implementation is currently available in the developer version of CCP4 and will eventually be part of the official release version of CCP4; further information and updates will be made available at https://pandda.bitbucket.io.