Evolving data standards for cryo-EM structures

Electron cryo-microscopy (cryo-EM) is increasingly being used to determine 3D structures of a broad spectrum of biological specimens from molecules to cells. Anticipating this progress in the early 2000s, an international collaboration of scientists with expertise in both cryo-EM and structure data archiving was established (EMDataResource, previously known as EMDataBank). The major goals of the collaboration have been twofold: to develop the necessary infrastructure for archiving cryo-EM-derived density maps and models, and to promote development of cryo-EM structure validation standards. We describe how cryo-EM data archiving and validation have been developed and jointly coordinated for the Electron Microscopy Data Bank and Protein Data Bank archives over the past two decades, as well as the impact of evolving technology on data standards. Just as for X-ray crystallography and nuclear magnetic resonance, engaging the scientific community via workshops and challenging activities has played a central role in developing recommendations and requirements for the cryo-EM structure data archives.


INTRODUCTION
Electron cryo-microscopy (cryo-EM) has very recently become a mainstream area of structural biology and medicine, enabling 3D visualization of a wide variety of biologically important complexes that were previously inaccessible to science. Early cryo-EM 3D density maps typically lacked atomic detail, yielding only the overall molecular shape, but could still sometimes be interpreted at a "pseudo-atomic" level via fitting of previously known coordinates or homology models [ Fig. 1(a)]. 1 Recent major technological advances now make it increasingly possible to directly visualize atomic details [ Fig. 1(b)]. 2,3 These achievements were recognized by the award of the 2017 Chemistry Nobel Prize to cryo-EM pioneers Dubochet, Frank, and Henderson. 4 The development of cryo-EM is directly reflected by the growth of cryo-EM structure depositions contributed worldwide to public data archives [ Fig. 2(a)]. The archiving systems and underlying data standards supporting deposition, annotation, release, and validation of cryo-EM structures and the associated metadata describing cryo-EM experiments have been developed over time to support this growth. 5 We outline here the history of these systems and describe the process by which data standards have been developed, highlighting the role of engaging the scientific community to develop recommendations and requirements. The archiving systems and standards continue to evolve as technology drives the need for new descriptors and validation metrics.

CRYO-EM STRUCTURE DATA ARCHIVING
The Protein Data Bank (PDB), established in 1971 as a public archive for atomic coordinates of biological structures derived from Xray crystallography, 6 began accepting models derived from nuclear magnetic resonance spectroscopy (NMR) in 1988, 7 and from electron microscopy (EM) and electron crystallography (EC) in 1990. 8 In recognition of the fact that publicly available 3D density maps could accelerate discovery in structural biology and medicine, the Electron Microscopy Data Bank (EMDB) at the European Bioinformatics Institute (EBI) was launched in 2002 with support from the European Union. 9 EMDB's launch was quickly followed by a pair of editorials in Structure and Nature Structural Biology encouraging electron microscopists to deposit their density maps. 10,11 Similar to BioMagResBank (BMRB), which archives experimental data from NMR, 12 the EMDB accepts maps determined using any cryo-EM method, including single particle reconstruction with any symmetry, helical filament reconstruction, subtomogram averaging, tomography, and electron crystallography, along with metadata describing the full experimental workflow (Fig. 3).
In 2006, scientists in the UK (EMDB) and USA [Research Collaboratory for Structural Bioinformatics (RCSB) and the National Center for Macromolecular Imaging (NCMI)] initiated a collaboration funded by National Institutes of Health (NIH) aimed to ensure that data archiving and validation standards for cryo-EM maps and models would be coordinated internationally. 13 EMDep, designed and implemented at EBI, was the first system designed to collect and annotate maps and associated metadata for EMDB. 9 In 2008, the EMDR team created a joint mapþmodel deposition system for cryo-EM structures by connecting EMDep with AutoDep and ADIT (AutoDep Input Tool), the PDB data collection FIG. 4. Current systems for deposition, archiving, and accessing cryo-EM structures. Worldwide, every cryo-EM structure (map, experimental metadata, and optionally coordinate model) is deposited and processed through the wwPDB OneDep system (deposit.wwpdb.org), following the same annotation and validation workflow also used for X-ray crystallography and NMR structures. 17,18 Map-only depositions yield an EMDB entry, while joint mapþmodel depositions yield both EMDB and PDB entries. Workflow metadata collected in OneDep are passed to both EMDB and PDB. EMDB holds all workflow metadata while PDB holds a subset of the metadata; see Table I. The PDB and EMDB archives are accessible by FTP and rsync at wwPDB mirror sites in the US, UK, and Japan. Released cryo-EM structure data from both archives can be accessed via EMDataResource, EMDB, and wwPDB partner websites.  systems at the EBI and RCSB sites. 13 The system that was implemented enabled a one stop shop for cryo-EM model and map depositions. Joint curation ensured that maps and models were deposited at the same physical scale and in the same coordinate frame. Journals that publish cryo-EM structures began to require authors to deposit maps to EMDB and models to PDB. This system supported the processing and release of nearly 4000 maps and 1000 models over a nineyear period (2008-2015). 14 In 2012, the Electron Microscopy Public Image Archive (EMPIAR) was established at EBI. 15 Supported by the UK Medical Research Council and UK Biotechnology & Biological Sciences Research Council, EMPIAR enables cryo-EM scientists to archive and share raw images and intermediate data files associated with their maps deposited to EMDB. Making recently collected image data broadly available has multiple benefits, including accelerating development of reconstruction software, and enriching resources for cryo-EM scientists in training. EMPIAR has its own deposition and curation system, but accesses metadata from the related EMDB entry. Individual entry storage sizes can be up to 15 TB. Approximately 4% of EMDB entries deposited since 2012 have associated EMPIAR entries.
The Worldwide PDB (wwPDB) is the global organization that manages the PDB archive. 16 In 2016, deposition, annotation, and release of cryo-EM structure maps and models were migrated to the wwPDB OneDep system (Fig. 4), using requirements that were initiated and developed by EMDR. 17 At that time, it became mandatory to deposit maps to EMDB for all cryo-EM models deposited to PDB. In addition, structure validation reports, which can be provided by depositors in an official PDF format to journal editors and reviewers as part of manuscript review, began to be produced for all cryo-EM structures. 18

CREATING A DATA DICTIONARY
The foundation of any data repository is its data representation scheme. Based in part on the International Union of Crystallography dictionary for small molecule crystallography Crystallographic Information File (CIF), 19 the Macromolecular Crystallographic Information File (mmCIF) was developed in the 1990s to support rich data content for the macromolecular crystallographic experiment and its results, with precise data type definitions, logical groupings for related data items, explicit parent-child relationships, enumerations for controlled vocabulary, extensibility, and many other features embedded in a computer-readable format. 20 This dictionary is now the Master Format for the PDB. Particularly relevant for cryo-EM, very large complexes are readily represented, since mmCIF has no limits on the number of atoms or polymer chains.
Following the lead of the crystallographic community, an mmCIF extension dictionary containing data terms for cryo-EM experiments was drafted jointly in the early 2000s based on requirements provided by the cryo-EM community. The dictionary was vetted and expanded by the scientific community via multiple workshops, and subsequently integrated by EMDR into the PDBx/mmCIF dictionary for use in the hybrid joint mapþmodel deposition system. 13 In 2015, based on feedback from additional workshops, the EMDR team further modified and expanded the dictionary in several ways. Hierarchical descriptions of complex specimens were enabled, and experimental descriptions for each of the cryo-EM methods were extended. 5 The >500 term EM dictionary (Table I) is now the basis for cryo-EM depositions to both EMDB and PDB in the wwPDB OneDep system. The dictionary continues to be updated regularly to support the evolving needs of the scientific community.

GATHERING COMMUNITY REQUIREMENTS
Developing a trusted scientific data repository requires careful attention to the interplay among science, technology, and community interest. 21 Workshops and Challenges are two types of community outreach activities that are effective in bringing these three elements together; both have been employed multiple times to move EM data and validation standard development forward [ Fig.  2(b)]. Workshops (typically 2-3 days) enable groups of experts to review current practices and develop recommendations, while Challenges (taking place over several months to a year) provide forums for experts to exercise and demonstrate current workflows and test novel procedures. Challenges can incorporate one or more workshops for planning or results review. Tables II and III list and summarize goals and outcomes of 18 international workshops and six challenges held over the past two decades. Below we provide additional descriptions of selected activities, as well as a montage of workshop photos (Fig. 5).

EM extension dictionary development
The main goal of the 2004 Cryo-EM Structure Deposition Workshop [ Fig. 5(a)], attended by $30 scientists including cryo-EM, image processing, crystallography, database, funding agency, and journal representatives, was to develop a global community consensus on data items for deposition of density maps and atomic models derived from cryo-EM studies. Terms were reviewed category-by-category in two focus groups, and recommendations for revisions and extensions were obtained (Fig. 6). Furthermore, participants unanimously requested a "one-stop shop" for deposition and retrieval of the cryo-EM map and model data. Following the workshop, the dictionary was further revised with follow-up input from attendees. The resulting dictionary was presented at the 2005 3DEM Gordon Research Conference, and EMDR's project website became the requested onestop-shop portal.
The EM extension dictionary was next reviewed by software developers at the 2005 3DEM Developers Workshop to facilitate its integration with major 3DEM packages and electronic notebook systems. There were two important outcomes: (a) the draft dictionary was unanimously accepted by the participants and (b) a set of proposed conventions for describing EM micrographs and density maps was developed. 22 The conventions enable a standardized approach to image interpretation and presentation, with recommended units for common parameters, rotation and symmetry notations, and common sense principles such as "objects should have overall positive density" (early image correction procedures sometimes generated objects darker than their background depending on image processing and display software). The conventions were subsequently incorporated into the EM extension dictionary to facilitate representation of map-related data items in PDB and EMDB.  24 Needs for hierarchical sample description as well as extensions to cryo-EM experimental submethod descriptions were recognized. A future archival segmentation file format, for which requirements were gathered at the 2015 meeting, will make use of the hierarchy, enabling map regions to be connected with biological annotations. 5,24 Developing validation standards At the 2010 EM Validation Task Force (EM VTF) Workshop [ Fig.  5(b)], an international group of experts explored how to assess cryo-EM maps, models, and other data deposited into EMDB and PDB. For maps, participants recognized a critical need to develop standards for assessing map resolution and accuracy. They recommended establishing two fully independent image datasets at the outset for evaluating resolution by Fourier Shell Correlation (FSC); at the time, this was not typically done, but it is now the standard procedure. However, they also advised that maps still be carefully inspected to ensure that the resolution estimate by FSC is in accordance with the map's visible features.
The EM VTF's 2012 white paper notably called for the scientific community to develop new criteria for the evaluation of maps and for the evaluation of fit of the model to the experimental map density. 25 In contrast, in 2011 the VTF for X-ray crystallography published a comprehensive and detailed set of recommendations to validate structures and experimental data determined using X-ray crystallography. 26 The difference reflects the fact that cryo-EM is still a rapidly evolving field.
Validation standards and raw image data archiving were additional topics of discussion at the 2011 Data Management Challenges in 3D Electron Microscopy Workshop. 23 Several services were developed and implemented at EBI in response to workshop recommendations. The EMPIAR raw data archive was created, 15 and stand-alone FSC and tilt-pair servers were developed for depositors to validate their cryo-EM maps. 5,27 In addition, Visual Analysis web pages were designed to display an informative series of images and plots for every EMDB entry, and to help users assess data quality of released cryo-EM maps and models. 28,29 Two EMDR-sponsored challenges subsequently aimed to address the 2010 EM VTF's call for improved metrics to evaluate both maps and fit of models to experimental data (2016 Map and Model Challenges). Following the 2017 Joint Challenges Workshop at Stanford, which had over 90 participants [Fig. 5(d)]; key results and recommendations were collated into a virtual special issue of the Journal of Structural Biology published in December 2018. 30 The Map Challenge provided a unique forum for critically evaluating the standard method for estimating map resolution by FSC (Fig.  7, inset). A key observation was that as currently practiced, the

ARTICLE
scitation.org/journal/sdy procedure is not sufficiently standardized: a number of different variables (e.g., map box size, voxel size, filtering and masking practice, and threshold value for interpretation) can substantially impact the outcome. 31 As a result, different expert practitioners can arrive at different resolution estimates for the same level of map details. For example, two of the apoferritin maps submitted to the challenge had practitioner-estimated resolutions of 3.1 Å and 3.5 Å , respectively, though they were indistinguishable by eye. A direct conclusion is that any "reported-resolution"-based search or ranking for maps or associated models will have limited reliability. In follow-up discussions at the 2019 Frontiers in Cryo-EM Validation Workshop, one suggestion made was to have the archives independently estimate resolution by FSC from deposited unmasked, minimally filtered half-maps. This procedure would likely make comparisons between maps less susceptible (though not completely impervious) to variations in practitioner practice.
The 2017 Joint Challenges Workshop also sparked lively discussions about the potential for model-based metrics to estimate not only model quality, but also to provide one or more independent measures of map resolvability. Several procedures of this type have been proposed and tested. EMRinger evaluates whether density peaks in the map fall within the possible rotameric configuration for the carbon-b atom in a side chain. 32 Other procedures have been developed to measure map quality. For example, Z-scores capture how much larger the cross-correlations score (CCS) is for atoms in such features at their placed location compared to the CCS at displaced positions. [33][34][35] Another recently devised experimental metric, Q-score, measures resolvability of the individual atom(s) in reference to the model. 36

Changing validation goals
Looking at the distribution of reported resolution of maps released into EMDB annually over the past few years (Fig. 7), one can readily see a striking sharp recent increase in maps in the 2-4 Å range. This development is a direct result of recent technological improvements, and it changes the "goal-posts" for developing validation methods, adding urgency to the need for metrics to validate structures at near-atomic to atomic resolution.
The 2019 Model Metrics Challenge and associated 2019 Model Metrics Workshop were designed with the goal of evaluating metrics for map-model fit of moderately high-resolution maps (3.1-1.8 Å ). A full write-up will be published elsewhere, and we note two findings FIG. 7. Changing cryo-EM resolution landscape. Annual distribution of depositor-reported resolution for map entries released into EMDB. The sharp increase at 2-4 Å resolution is a direct consequence of the recent advances in image detection and processing. 3 Inset: example Fourier Shell Correlation (FSC) plot, which is the current standard for estimating map resolution. 25 The correlation between two independent half-map reconstructions (blue curve) falls with decreasing spatial frequency; the resolution estimate (in this case 3.9 Å ) is read at FSC ¼ 0.143 (dash-dotted horizontal line). Plot source: emdataresource.org. Inset FSC plot source: EMDB visual analysis. 28 here. First, the new metrics that by some means combine both model and map quality (e.g., EMRinger and Q-score) appear to be quite useful for ranking sets of structures. Second, refined Atomic Displacement Parameters (ADPs), which were included in about half of the models submitted by challenge participants, could modestly improve fit of the model to the map, particularly for the highest resolution (1.8 Å ) target map. The meaning of refined ADPs/B-factors in the context of a cryo-EM density map is less clear. Best practices (e.g., to avoid overfitting) will need to be investigated.

WHERE WE ARE, WHAT'S NEXT
The initial EM validation report format released in 2016 focused on assessment of model geometry for PDB entries. 18 As will be reported in more detail in a future publication, additional sections covering map analysis and visualization and map-model fit analysis and visualization will become available to EMDB and PDB depositors by early 2020. The Visual Analysis web pages hosted at EBI since 2012 28 have served as a test-bed for the development of the new features, which will include (a) several types of orthogonal images of the deposited map and map superimposed with model; (b) FSC curves to support depositor-reported map resolution; and (c) map-model fit statistics via "atom inclusion," the percentage of modeled atoms falling inside a map at its recommended contour level. The new features will enable scientists (depositors, annotators, journal editors, and manuscript reviewers) to make initial assessments of map features, map quality, and map-model fit, bypassing the need to first download/view files in a graphics program.
A planned meeting in January 2020 at EBI organized by wwPDB will bring together cryo-EM and data archiving experts to discuss the current state of data archiving for cryo-EM structures derived from the single-particle reconstruction method, and to solicit recommendations on what data should be included and/or made mandatory in depositions and associated validation reports. The following points might be considered as part of the deliberations: • Can estimation of map resolution be better standardized across the community? This would enable fairer comparisons among maps determined in different laboratories and using different software packages. • Additional metrics (beyond atom inclusion) are available that describe map-model fit, including several novel procedures that effectively yield a joint assessment of map and model quality in a broad resolution range. How should map-model fit be reported as part of a structure determination and in a joint mapþmodel deposition? • What best practice recommendations can be made for refinement of ADPs in cryo-EM models at different resolutions? • How should we evaluate multiple structures determined from a single specimen that may have variable quality and resolution?

ACKNOWLEDGMENTS
EMDataResource is funded by the U.S. National Institutes of Health/National Institute of General Medical Science, No. R01GM079429-12. We thank current and past EMDR colleagues for their contributions to data standards development, with special recognition to former and current directors of the EMDB archive including Kim Henrick, Gerard Kleywegt, and Ardan Patwardhan. We are also tremendously grateful to the cryo-EM community for their enthusiastic participation and support for data standards development.