Background & Summary

Accurately determining three-dimensional (3D) structure of proteins is critical for unlocking key insights into their molecular functions1 and interactions with other proteins as well as small molecules like ions, ligands, and therapeutic drugs2. In recent years, cryo-EM3 has emerged as the most important technology for experimentally determining the structures of large protein complexes that are difficult or impossible for other experimental techniques such as X-ray crystallography or Nuclear Magnetic Resonance (NMR) to solve. The field of cryo-EM is advancing at a rapid pace with improvements in image data collection and processing techniques4. This progress has resulted in large amount of high-quality cryo-EM images of proteins and their complexes, and high-resolution 3D density maps reconstructed from two-dimensional (2D) images. EMPIAR, the Electron Microscopy Public Image Archive5, is an archive for raw images, such as micrographs and particle stacks6, that is used in the construction of cryo-EM density maps. Subsequently, these reconstructed 3D cryo-EM density maps are deposited into EMDB, the Electron Microscopy Data Bank7.

However, the building of atomic 3D models of large protein complexes from cryo-EM density maps is still a very challenging problem that often requires extensive manual intervention. The task is particularly difficult if the structures of the individual chains (units) in large protein complexes are not available or cannot be predicted from sequences by cutting-edge protein structure prediction methods such as AlphaFold28 which serve as templates to fit into cryo-EM density maps for building the models of the complexes.

To address this problem, significant efforts have been put into developing machine learning, particularly deep learning (DL) methods9, to directly build atomic models from cryo-EM density maps as the density in each voxel (i.e., 3D pixel) of the high-resolution density map contains sufficient information about protein structures in most cases. Therefore, sophisticated deep learning models trained with density maps as input and their corresponding structures as labels can potentially infer protein models that fit the cryo-EM density maps correctly and accurately10,11.

However, designing and implementing deep learning-based methods for building atomic models from cryo-EM density maps requires a large labeled cryo-EM dataset. Different methods have been trained previously to predict different aspects of protein structures from cryo-EM density maps, nevertheless, up to now, as shown in Table 1, there is still no work on generating a publicly available, large, labeled dataset to push the frontier of the model building from cryo-EM density maps. Recently, the exponential growth of cryo-EM has led to a surge in the deposition of high-resolution cryo-EM density maps in the EMDB7 and their corresponding protein models in the Protein Data Bank (PDB)12. Leveraging these precious resources, we created Cryo2StructData13, a large labeled cryo-EM density map dataset for developing and testing AI-based methods to build atomic models from cryo-EM density maps. We also trained and tested deep learning models on the dataset to rigorously validate its quality. Cryo2StructData is the first, large, publicly available cryo-EM dataset with standardized input features and well-curated output labels that is fully ready for the development of AI-based atomic structure modeling tools in the field.

Table 1 Summary of the currently used number of experimental cryo-EM density maps in the datasets for various AI-based methods and their public availability.

Methods

Related works

Protein structure modeling from simulated density maps

Early methods to build atomic models from cryo-EM density maps utilized protein structures in the Protein Data Bank12 to generate theoretical density maps at different resolution, usually referred to as simulated density maps for training and testing. For instance, Cascaded-CNN14 utilized pdb2mrc from the EMAN2 package15, and VESPER16 utilized pdb2vol from the Situs package17 to generate simulated density maps. However, simulated density maps lack the complexity of real-world density maps such as high noise, missing density values, and experimental artifacts that arise from protein particle picking18 errors, or atom movement during image capturing in the cryo-EM data collection. As such, deep learning models trained on simulated density maps may not perform well on noisy experimental density maps. Therefore, real-world cryo-EM density maps with labels need to be created to further advance the field.

Protein structure modeling from experimental density maps

As more and more experimentally determined density maps became readily available in the EMDB, different methods have utilized these maps to predict atoms in cryo-EM density maps, as shown in Table 1. A recent method, DeepTracer19, offers users the ability to build models through a web interface. It utilized 1,800 experimental maps to predict the positions and amino acid types of Cα atoms of proteins, which were then aligned with protein sequences to model protein structures. ModelAngelo20 used 3,715 experimental maps for training and a test set of 177 maps for testing. Both DeepTracer and ModelAngelo automatically generate the atomic models of the protein, but they have not released their processed data. Haruspex21 used 293 experimental maps for training and an independent test set of 122 experimental maps for testing, and EMNUSS22 used 120 and 43 experimental density maps for training and testing respectively. CR-I-TASSER23, EMNUSS22, and Emap2sec24 employed a hybrid approach that combined both simulated maps and experimental maps in their training and validation processes. Haruspex, EMNUSS, and Emap2sec are designed for secondary structure prediction, but they have not made their processed data available.

Furthermore, despite the significant progress, the datasets used in these works only account for a small portion of high-resolution density maps available in the EMDB and they are not publicly available for the AI and machine learning community to use. Significantly, Giri et al.9 emphasize the critical importance of developing high-quality cryo-EM datasets, specifically for training and evaluating deep learning methods. In this context, we aim to prepare, preprocess, label, validate, and consolidate complementary information that can be utilized for accurate atomic model building through AI-based methods from experimental cryo-EM density maps.

Cryo2StructData preparation

The Cryo2StructData was prepared by a data processing pipeline shown in Fig. 1. The data generated from each stage of the pipeline was verified to ensure the details of original experimental density maps were preserved (see the Technical Validation section 2 for details). The source code of the data preparation pipeline is released at the GitHub repository of Cryo2StructData for users to reproduce the process. The details of each data preparation stage are described in the following subsections.

Fig. 1
figure 1

The data preparation and evaluation pipeline for Cryo2StructData.

Data extraction

Cryo2StructData was curated from the experimental cryo-EM density maps released till 27 March 2023. We downloaded relatively high-resolution cryo-EM density maps for proteins and their complexes with resolutions between 1 and 4 Angstrom (Å). In total 9,500 cryo-EM density maps were collected from the EMDB7. Similarly, we downloaded the atomic models (i.e., PDB files) corresponding to the density maps from the PDB12. The downloaded PDB files were used to label voxels of density maps. The associated PDB ID for each cryo-EM density maps was extracted from the metadata downloaded from Electron Microscopy Data Bank (EMDB).

We further filtered out the density maps that do not have structures in the PDB. After the filtering, 7,600 cryo-EM density maps were left for further processing. A metadata document containing the meta information such as the EMD ID and corresponding PDB entry for each density map, resolution of the density map, structure determination method, software used to determine the density map, the title and the journal of the article describing the density map is provided in the Cryo2StructData Dataverse25.

Preprocessing

The experimental cryo-EM density maps are stored in MRC26 format. The MRC format contains the data in a 3-dimensional (3D) array, often referred to as 3D grid. Each voxel also known as 3D pixel contains a value indicating how strong an atom’s signal is present in the voxel. We refer to these density maps as original raw density maps, which need to be further preprocessed so they can be used to train AI methods. Building upon established practices and techniques14,19, the preprocessing is implemented in two steps: (a) cryo-EM density map resampling and (b) cryo-EM density map normalization, as follows:

Resample cryo-EM density maps

Different raw cryo-EM density maps usually have different voxel sizes, which need to be standardized. We resampled the density maps to a uniform voxel size of 1 Angstrom (Å), using vol resample command within UCSF ChimeraX27 in the non-interactive mode. The idea of resampling density map is illustrated in Fig. 2a,b. All the resampled density maps have uniform, the same voxel size and are used for the following normalization step.

Fig. 2
figure 2

An example of density map grid resampling and division of a cryo-EM density map. (a) Density map (EMD-22898) in the original grid. (b) Resampled density map to the uniform grid size of 1 Å. (c) Grid division of the density map where each cube is sub-grid with specific size. In this illustration, it is 1 Å × 1 Å × 1 Å. A coarser grid is shown for the purpose of illustration.

Normalize cryo-EM density maps

We applied scaling and clipping to normalize the density values of the resampled cryo-EM density maps. In the cryo-EM density maps, positive density values represent regions (voxels) where the protein is likely present. The range of these values can differ from maps to maps with some maps containing values in one range (e.g., [−2.32, 3.91]) and others in another range (e.g., [−0.553, 0.762]). To make the density values of different density maps comparable, we perform the percentile normalization by first calculating 95th percentile of the positive density values in a density map and then dividing the density values in the map by this value. To deal with the extreme outliers in the density values, we set all values below 0 to 0 and all values above 1 to 1 after the division. The normalization removes the cross-map difference caused by differences in experimental conditions, and software used to process the maps, allowing AI methods to learn patterns across different cryo-EM density maps. Finally, the resampled and normalized map is saved in the MRC format28 with the filename as emd_normalized_map.mrc. This file is available inside the directory named after the EMDB entry ID and is used as an input for AI (e.g., deep learning) models and for the label generation process.

Labeling density maps

Each voxel of a cryo-EM density map has a density value which positively correlates with the possibility of the presence of an atom’s signal in the voxel, which can be used as input features to predict positions and types of atoms. After a broad review9 of the structural properties of current machine learning methods to build atomic models from density maps, we created the following labels for voxels in a density map: atom labels, amino acid labels, and secondary structure labels.

For atom labeling, we created an empty (with values of 0) mask with the same dimension as the emd_normalized_map.mrc, indicating that the mask was empty. We then utilized the corresponding PDB file for the density map to label each voxel containing backbone carbon-α atom (Cα) as 1, backbone nitrogen atom (N) as 2, and the carbonyl backbone carbon atom (C) as 3. The mask created from the emd_normalized_map.mrc map is a 3D grid, where the location of each voxel is determined by indices (i, j, k). But the corresponding PDB file used to label the voxels are in another 3D coordinate system (x, y, z). Therefore, we calculated the corresponding indices of each backbone atom (Cα, N and C) in the mask from it’s atomic coordinates using the Formula 1, where i, j, k are the grid indices of the atom in the mask, x, y, z are the coordinates in the PDB file, originx,originy,originz are the origin of x, y, z axis respectively found in the emd_normalized_map.mrc map, and voxelx,voxely,voxelz are the voxel size of x, y, z axis respectively found in emd_normalized_map.mrc map.

Similar to the atoms labeling, we created a second mask containing amino acid type labels. We labeled each voxel containing a Cα atom with one of 20 different types of standard amino acids, represented by 20 numbers from 1 to 20, while 0 denotes the absence of Cα atom or unknown amino acid type. Moreover, following the same approach used for atoms and amino acid types labeling, we created a third mask for secondary structure labels. We used UCSF ChimeraX27 to identify and extract the secondary structure type (coil, β-strand, and α-helix) for each Cα atom from the PDB file. The extracted secondary structures were then used to label each voxel of the mask that contains a Cα atom. We used 1, 2, and 3 to represent coil, α-helix, and β-strand, respectively. Figure 3 shows an example of density map labels.

$$i=\lceil \left(\frac{\lfloor (z-origi{n}_{z})}{voxe{l}_{z}}\right);j=\lceil \left(\frac{\lfloor (y-origi{n}_{y})}{voxe{l}_{y}}\right);k=\lceil \left(\frac{\lfloor (x-origi{n}_{x})}{voxe{l}_{x}}\right)$$
(1)
Fig. 3
figure 3

An example of labeling a cryo-EM density map. (a) The density map of SARS-CoV spike gycoprotein, EMD-22898 visualized at recommended contour level of 0.7. (b) Three different types of protein backbone atoms (Cα, N, C) labeled in different colors. (c) Three different secondary structure elements labeled in different colors. (d) Twenty different amino acid labeled in different colors. The image are generated using UCSF ChimeraX’s27 surface color by volume data value.

Data splits

In the phase of training deep learning models, the training dataset is used to optimize the parameters (i.e., weights) of the models, and the validation dataset is used to evaluate their performance on the data not used to train the model, which can be used to reduce/avoid the overfitting and select a final trained model for test. Subsequently, the selected model is blindly tested on the test dataset to provide an unbiased evaluation of the model’s capability of generalizing to new cryo-EM density maps not used in both the model training and selection. We selected 208 cryo-EM density maps, which had previously been used as test data in the previous works19,29, from our dataset to create the test dataset30. The remaining 7,392 density maps were split into the training dataset consisting of 6,652 maps and the validation dataset consisting of 740 maps according to 90% and 10% ratio. Table 2 reports the number/percentage of maps in the training and validation datsets for each resolution range, indicating the statistics for the two datasets is largely consistent. The resolution value of a cryo-EM maps refers to the level of detail at which the structural features of the biological molecules can be visualized within the 3D map. A smaller resolution value indicates higher resolution and better definition of structural features, while a larger resolution value corresponds to lower resolution and less distinct features. As such, we used resolution-based splitting of the cryo-EM density maps for training and validation of our deep learning model. Additionally, the ID-based splits for the full dataset are provided in the Cryo2StructData Dataverse13 because the unique IDs assigned to each cryo-EM density map can be influenced by factors such as the type of entry, metadata, and curation criteria. If necessary, users may choose to split Cryo2StructData into training, validation and test datasets differently.

Table 2 The resolution distribution of the maps in the train and validation datasets.

Grid division

The dimensions of density maps of different proteins vary and are usually too large to fit into the memory of a standard GPU for training deep learning models. Similar to the approach employed in DeepTracer19, Haruspex21, CR-I-TASSER23, Emap2sec24, and Cascaded-CNN14, we performed grid division to divide the density maps into 3D subgrids with dimension of 32 × 32 × 32 overlapped by 6 voxels on each face of the subgrid to train deep learning methods. We choose the dimension 32 as it is big enough to capture the patterns (e.g. six turns of alpha helix) in the data and is small enough to be used with GPUs effectively. In the inference stage, the predictions for the sub-grids can be stitched back to obtain the prediction for the full density map by concatenating the predictions for the central 20 × 20 × 20 core voxels of the sub-grids. This approach allowed us to preserve the spatial information of the density maps and overcome the abnormal predictions for the voxels being cut off at boundaries of sub-grids during the grid division. Figure 2c shows an example of grid division. The sub-grids of each density map along with their corresponding labeled mask maps are saved as a single entity in numpy array for deep learning training and test. In total, there are ~39 million training sub-grids and ~4 million validation sub-grids. Moreover, the scripts of dividing density maps are provided at the Cryo2StructData’s GitHub repository for users to create sub-grids of their defined dimensions if necessary.

Data Records

Cryo2StructData Dataverse13,31 contains the necessary metadata, 3D density map, and mask files to enable AI experts without much domain knowledge to develop AI methods for modeling protein structures from cryo-EM density maps.

The following data files for each cryo-EM density map are provided in its individual directory:

  • The original cryo-EM density map with its corresponding protein model in the PDB format, the original protein sequence in the FASTA format32, and the sequence extracted from the PDB file.

  • The Clustal Omega33 alignment between the primary sequence listed in the metadata of the PDB entry and the sequence extracted from the 3D coordinates in the PDB file. The two sequences are usually highly similar but not identical because some residues in the original sequence may be disordered and has not been modeled (i.e, no x, y, z coordinates) in the PDB file.

  • The resampled and normalized cryo-EM density map generated from the original density map. The values of the voxels in this density map were normalized into the range [0, 1] from their values in the original density map and are ready for being used as input for AI models.

  • The labeled masks in which the voxels containing the key backbone atoms (carbon-alpha (Cα), nitrogen (N) and carbonyl backbone carbon (C)), the Cα atoms only, the twenty different types of amino acids that Cα atoms belong to, and the three different types of secondary structures (helix, strand, coil) of the Cα atoms are labeled. The labeled masks can be used as targeted outputs for AI and machine learning models to predict from the resampled and normalized density maps (input).

A comprehensive documentation of the Cryo2StructData is available within the Cryo2StructData Dataverse13,31,34,35. This documentation provides in-depth insights into the dataset, elucidating the composition of data files, the structure of directories, and the overall organization of the dataset.

Technical Validation

We validated every step of the data processing pipeline as shown in Fig. 1 and described in Cryo2StructData Preparation section. The source code used to validate data preparation pipeline is released at the GitHub repository of Cryo2StructData for users to reproduce the process.

Validation of preprocessing step

The preprocessing phase of Cryo2StructData preparation performs two steps: (a) cryo-EM density map resampling and (b) cryo-EM density map normalization. We verified both steps for all the cryo-EM density maps present in Cryo2StructData. Additionally, since each cryo-EM density map contains a unique EMDB-ID, we verified the availability of duplicate density maps and found none.

Resampling validation

During the initial stage of data preparation, the density maps were resampled to a uniform voxel grid with a voxel size of 1 Å. To assess the accuracy of this resampling process, we performed validation checks to confirm the voxel size of all density maps within the Cryo2StructData. The validation procedure utilized the mrcfile package28, a Python implementation of the MRC2014 file format26, widely utilized in structural biology for storing cryo-EM density maps. The results of the validation phase clearly demonstrated that all cryo-EM density maps in the Cryo2StructData exhibited a voxel size of 1 Å. This validation, in turn, confirms the precision and consistency of the resampling step, substantiating the attainment of the desired voxel size of 1 Å.

Normalization validation

To assess the validity of the normalization step in the data preparation process, we performed checks to identify any outliers and ensure that all density values were confined within the range of [0, 1] for each density map. The validation check did not find any outliers, and all density values were found to lie within the specified range [0, 1]. This validation further corroborates the consistent implementation of the preprocessing step, as all density maps in the Cryo2StructData exhibited values within the defined range [0, 1].

Figure 4 displays the statistical characteristics of a density map, both before and after the preprocessing step, as visualized using UCSF Chimera27. By analyzing Fig. 4, we can further confirm that the voxel size is 1 Å, and the density values fall within the range of [0, 1], as expected. This provides additional evidence of the correct implementation of the resampling and normalization steps.

Fig. 4
figure 4

The cryo-EM density map of human p97 bound to UPCDC30245 inhibitor (EMD-3295). (a) Density values in raw cryo-EM density map. (b) Density values after preprocessing (resampled and normalized). The voxel size is 1 Å, and the density values are between[0–1]. The image is extracted from volume viewer of UCSF Chimera27.

Validation of labeling step

We validated the labeling step by leveraging the mask map of labeled Cα atoms and their corresponding known atomic backbone structure. To obtain approximate 3D coordinates (x, y, z) from their respective indices (i, j, k), we applied Formula 2. The grid indices, (i, j, k), were assigned values of either 0 (indicating the absence of a Cα atom) or 1 (indicating the presence of a Cα atom) by the data preparation pipeline. After converting the indices to their associated coordinates for each Cα atom, we observed, as expected, that the protein backbone structure was not aligned. Consequently, a direct structure-to-structure comparison for evaluation was not feasible. To assess the accuracy of the conversion step, we performed a comprehensive search for the converted coordinates within the known Cα backbone structure. During the search process, we kept track of the number of precise Cα atoms matching with the known backbone structure. We randomly selected approximately 4,450 (>55%) of the Cryo2StructData to verify the conversion process. We found that around 4,185 (94%) of the maps had exact matching Cα positions for all the voxels. For the remaining 6% of the maps, more than 93% of the voxels had exact match, while less than 7% of the voxels deviated slightly from the exact match. The loss of precision in a small portion of Cα positions in a small portion of density maps is expected due to the interchange between floating-point numbers and integers during the labeling and evaluation steps. Therefore, this validation confirm the accuracy of the labeling step, as smaller than 0.5 angstrom (Å) deviations of a small portion of Cα (about 6% × 7%) positions is expected.

$$x=\left(k\times voxe{l}_{x}\right)+origi{n}_{x};y=\left(j\times voxe{l}_{y}\right)+origi{n}_{y};z=\left(i\times voxe{l}_{z}\right)+origi{n}_{z}$$
(2)

Validation of MRC files

We conducted a comprehensive validation of the entire Cryo2StructData by executing a series of tests to verify its compliance with the MRC2014 format specification26. These tests were performed on the data listed below, which pertains to each density map within the Cryo2StructData:

  1. 1.

    emd_normalized_map.mrc: This file is used as an input to the deep learning-based model for training, validation and inference step. As such, this is an important file that needs to pass all the tests.

  2. 2.

    atom_ca_emd_normalized_map.mrc: This is the labeled Cα-only mask map used as labels for training the deep learning-based model, which identifies the Cα atoms in the input density map.

  3. 3.

    atom_emd_normalized_map.mrc: This is the labeled backbone atom (Cα, C, and N) mask map used as labels for training the deep learning-based model, which identifies the backbone atoms in the input density map.

  4. 4.

    amino_emd_normalized_map.mrc: This is the labeled amino acid type mask map used as labels for training the deep learning-based model, which identifies the amino acid residue types of Cα atoms in the input density map.

  5. 5.

    sec_struc_emd_normalized_map.mrc: This is the labeled secondary structure (coil, α-helix, and β-strand) mask map used as labels for training the deep learning-based model, which identifies the secondary structure types of Cα atoms in the input density map.

Each of the above files went through a series of tests and checks to identify any potential problems. These files undergoes validation using the mrcfile package, specifically the mrcfile.validate function. The mrcfile Python library is maintained by the Collaborative Computational Project for Electron cryo-Microscopy (CCP-EM)28 and serves for reading, writing, and validating MRC2014 files26. The following tests were conducted to validate the Cryo2StructData. We provide a brief description of these tests below and refer users to the MRC2014 paper26 for detailed explanations of these tests.

  1. 1.

    MRC format ID string: The map field in the header should contain “MAP”. The character string ‘MAP’ is used to identify file type.

  2. 2.

    Machine stamp: The machine stamp should contain one of 0 × 44 0 × 44 0 × 00 0 × 00, 0 × 44 0 × 41 0 × 00 0 × 00 or 0 × 11 0 × 11 0 × 00 0 × 00.

  3. 3.

    MRC mode: The mode field should be one of the supported mode numbers:

    1. (a)

      0 8-bit signed integer

    2. (b)

      1 16-bit signed integer

    3. (c)

      2 32-bit signed real

    4. (d)

      4 transform: complex 32-bit reals

    5. (e)

      6 16-bit unsigned integer

    6. (f)

      12 16-bit float

  4. 4.

    Map and cell dimensions: The header fields nx, ny, nz, mx, my, mz, cella.x, cella.y, and cella.z must all be positive numbers.

  5. 5.

    Axis mapping: The header fields mapc, mapr, and maps must contain the values 1, 2, and 3 - in any order.

  6. 6.

    Volume stack dimensions: If the spacegroup is in the range 401–630, representing a volume stack, the nz field should be exactly divisible by mz to represent the number of volumes in the stack.

  7. 7.

    Header labels: The nlabl field should be set to indicate the number of labels in use, and the labels in use should appear first in the label array.

  8. 8.

    MRC format version: The nversion field should be 20140 or 20141 for compliance with the MRC2014 standard.

  9. 9.

    Extended header type: If an extended header is present, the exttyp field should be set to indicate the type of extended header.

  10. 10.

    Data statistics: The statistics in the header should be correct for the actual data in the file, or marked as undetermined.

  11. 11.

    File size: The size of the file on disk should match the expected size calculated from the MRC header.

During the checks, if the file is completely valid, the mrc.validate function returns True; otherwise, it returns False. Seriously invalid files trigger a RuntimeWarning. All data items listed above successfully passed the validation checks for the entire collection of cryo-EM density maps in the Cryo2StructData. This affirms the dataset’s full compliance with the MRC2014 format, confirming its validity as a set of density maps and masks suitable for application in AI-based methodologies.

Validation using deep learning

To further validate the utility and quality of Cryo2StructData, we trained and test two deep transformer-based models36 on Cryo2StructData to predict backbone atoms and amino acid types from density maps respectively. Moreover, we used predicted Cα atom positions and probabilities of 20 amino acid types to construct a Hidden Markov Model (HMM) model36,37 whose hidden states represent predicted Cα atoms. The HMM model was used to align protein sequences with predicted Cα positions to build the backbone structures of proteins via a customized Viterbi algorithm36,38. In the subsequent sections, we present a concise overview of the deep learning model and the HMM-guided alignment technique applied to predicted Cα voxels. For a comprehensive understanding of implementation details and in-depth analysis involving additional atomic structures, we direct readers to the Cryo2Struct paper36.

Training and testing deep transformer models on Cryo2StructData

One transformer model was trained to classify each voxel of a sub-grid of a density map into one of four different classes representing three backbone atoms (Cα, C and N) and absence of any backbone atoms. Another model was trained to classify each voxel of the sub-grid into one of twenty-one different amino acid classes representing twenty different amino acids and unknown or absence of amino acid. Additionally, we also trained a transformer model to predict the secondary structure of voxels, which was not used by build the HMM. In the inference stage, the predictions for all the sub-grids of a full density map are combined to generate a prediction for the entire map. We trained different sets of models with different parameters (measured in millions), dataset sizes, and extraction layers, as shown in Table 3.

Table 3 Training on the Cryo2StructData with different deep learning models and data sizes.

For the Small Train, we used 25% of the Cryo2StructData (small subsample31), with 4 transformer layers and 6 attention heads. For Full Train v1, we utilized all the training and validation data from the Cryo2StructData (full data34), with 12 transformer layers and attention heads. We used the full trained models34 to construct HMMs (Section 16). The training and validation macro F1 score plots generated for Full Train v1 and Small Train are shown in Fig. 5. These plots illustrate the increase in F1 scores for both full and small training, providing further evidence that models trained on Cryo2StructData data can successfully learn and identify patterns in the data.

Fig. 5
figure 5

The training and validation F1 scores for transformer-model trained on Cryo2StructData. (a) Full Train v1. (b) Small Train.

For Full Train v2 + Seq, we utilized all the training and validation data from the Cryo2StructData (full data34), with 24 transformer layers and 12 attention heads. In addition, we integrated a protein language model (ESM-239) generated sequence features into each protein density map, enhancing the sub-grid input with sequence information before passing it into the transformer layer. The per-residue representation is extracted using the ESM-2 pretrained model and then averaged to generate the per-sequence representation. We employed the ESM-2 pretrained model, which comprises 33 layers and 650 million parameters, to generate embedding dimensions of 1,280. This computation was performed on the CPU of the Andes supercomputer40, utilizing its 256 GB RAM. To accommodate memory constraints, we utilized a sequence cutoff of 2,750 residues for each protein sequence. We extracted the encoded sequence representation from the 3rd, 6th, 12th, and the final 24th layers. We noticed an increase in both validation recall and precision for the amino and atom type prediction tasks. A comparison of recall scores between Full Train v1 and Full Train v2 + Seq is presented in Fig. 6. By incorporating sequence information and enhancing the model’s complexity, the transformer-based model trained on the Cryo2StructData demonstrates improved learning. The sequence for each density map is made available within the Cryo2StructData, providing a convenient resource for AI-based researchers to utilize this complementary data and enhance the prediction capabilities of deep learning models.

Fig. 6
figure 6

The validation recall scores for amino acid type prediction during training. (a) Full Train v1. (b) Full Train v2 + Seq. Incorporating sequence information to train the model improves the recall.

HMM-Guided Alignment of Cα Atoms and Protein Sequences

Connecting the predicted carbon-α atoms into protein chains and determining their amino acid types accurately is one of the most challenging aspects of building protein model from a cryo-EM density map. To address this challenge, we constructed a HMM36 from Cα atoms and their amino acid types predicted by the deep transformer models. This model aligns the known protein sequences with the predicted hidden states denoting Cα atoms in the HMM, allowing us to link Cα atoms into protein chains and assign amino acid types to the atoms in a single step.

HMMs are generative models described by transition probability between hidden states and emission probabilities of generating observed symbols from hidden states. Specifically, the predicted Cα atoms (voxels) from the atom type prediction are used as hidden states in the HMM, which are fully connected to generate the sequence of a protein. The amino acid emission probabilities of each Cα hidden state are the normalized geometric mean of the probabilities of twenty different amino acid types predicted by the amino acid type prediction model and their background probabilities precomputed from the protein sequences in the training data. The geometric mean is computed as \(\sqrt{a\times b}\), where a and b represent the predicted probability for an amino acid type and its background frequency, respectively. The transition probability between any two Cα hidden states are calculated according to the euclidean distance (x) between two Cα atoms based on the Gaussian distribution with a mean (μ) of 3.8047 and a standard deviation (σ) of 0.036 times a fine-tune able scaling factor (Λ) (Λ = 10).

Because it is not known which Cα state generates the first amino acid of a protein chain, the HMM allows every Cα hidden state to be the initial state. The probability for a Cα state to be the initial state is equal to the probability of it emitting the first amino acid divided by the sum of the probability of every state emitting the first amino acid. The constructed HMM is then used by a customized Viterbi algorithm36 to compute the most likely path of aligning protein sequences with the Cα hidden states, resulting a determined protein backbone structure. The customized Viterbi algorithm allows any Cα state to occur at most once in the path because one Cα position can be occupied by only one amino acid of a protein.

Evaluation results on SARS COVID-19 proteins

We tested the trained deep learning models on three SARS-CoV-2 proteins41,42,43 and seven other proteins. These proteins were not part of training dataset. The voxel-wise predictions of Cα atom were evaluated using F1-score (i.e., geometric mean of precision and recall of Cα predictions). We used F1-score as it is a more balanced metric than the accuracy when there is a significant imbalance in the class distribution (i.e., the portion of voxels containing Cα atoms is very small in this case). We compare the F1 scores of our predictions with those of the random predictions in Fig. 7. The former is much better the latter.

Fig. 7
figure 7

F1 scores of Cα atom predictions for the density maps of two SARS COVID-19 (EMD-2289841, EMD-2585543) and a human p97 (EMD-329544) test proteins. The curve in blue denotes how F1 score of the deep transformer trained on Cryo2StructData changes with respect to the threshold on predicted Cα atom probabilities and the curve in orange the random predictions.

We further evaluated the Cα backbones aligned by the HMM for the two SARS COVID-19 proteins41,43 and the human p97 protein44 against the known protein structures as shown in Figs. 810. To adhere to the commonly practiced approach in the literature19,29, we used phenix.chain_comparison tool to compute the root mean squared distance (RMSD), the percentage of matching Cα atoms, and the percentage of sequence identity. Phenix’s chain_comparison tool compares two structures to identify the number of matching Cα atoms. Using this approach, it calculates the matching percentage, which represents the proportion of residues in the known structure that have corresponding residues in the modeled backbone structure. Similarly, it reports the sequence matching percentage indicating the percentage of matched residues that have the same amino acid type. The results in Table 4 shows that using the deep learning predictions generated from the model trained on Cryo2StructData, combined with HMM alignment strategy, we were able to determine the backbone structures of the 10 proteins having varying resolutions and number of residues with the good accuracy on average. The average RMSD, where lower values are preferable, is 1.51 Å. The matching percentage and sequence identity (ID) percentage, where higher values are better, is 60.4% and 37.94%, respectively. The average resolution and the number of residues modeled are 2.77 Å and 2222.3, respectively. Furthermore, the Cryo2Struct paper36 presents a thorough examination of atomic structures generated by a deep learning model trained on Cryo2StructData. Figures 813 shown in the Several examples of predicting Cα atoms and building protein backbone structures section illustrate several good, detailed examples of predicting Cα atoms and modeling protein backbone structures from density maps.

Table 4 Evaluation scores for predicted backbone structures of proteins with varying resolutions and numbers of residues.

Several examples of predicting Cα atoms and building protein backbone structures

Figures 1113 visualize and analyze the predicted Cα backbone structures of three proteins with respect to the true Cα atoms of the density maps. The predicted backbone models were built by the deep transformer trained on Cryo2StructData and Hidden Markov Model. Figures 810 compare the modeled backbone structures of the same three proteins with their true backbone structures extracted from the known protein structures. We used UCSF ChimeraX27 for the visualization of the cryo-EM density maps and the protein structures. Generally, the modeled backbone structures match the true Cα atoms in the density maps and the true backbone structures reasonably well.

Fig. 8
figure 8

(a) The known protein structure corresponding to the density map EMD-22898 (PDB code 7KJR). The known structure has 448 residues. (b) The true Cα backbone structure extracted from the known PDB structure. (c) The superimposition of the predicted backbone (blue) structure with the known backbone structure (gold).

Fig. 9
figure 9

(a) The known protein structure of the density map EMD-25855 (PDB code 7TEY). The known structure has 2,703 residues. (b) The true Cα backbone structure extracted from the known protein structure. (c) The superimposition of the predicted backbone structure (blue) with the known backbone structure (gold).

Fig. 10
figure 10

(a) The known protein structure of the density map EMD-3295 (PDB code 5FTJ). The structure has 4,338 residues. (b) The true Cα atom backbone structure extracted from the known protein structure. (c) The superimposition of the predicted backbone structure (blue) with the known backbone structure (gold).

Fig. 11
figure 11

(a) The cryo-EM density map of SARS-CoV-2 ORF3a (EMD-22898) visualized at the recommended contour level of 0.7 (10.3 σ). The map dimension is 300 × 300 × 300 and has density values between −2.319−3.909. The voxel dimensions is 0.727 × 0.727 α 0.727 (Å). (b) The true Cα atom voxels (mask) extracted from the density map. (c) The predicted Cα model is overlaid with true Cα atom voxels.

Fig. 12
figure 12

(a) The cryo-EM map of SARS-CoV-2 Delta (B.1.617.2) spike protein (EMD-25855) visualized at recommended contour level of 0.121 (5.0 σ). The map dimension is 400 × 400 × 400 and has density values between −0.467−1.315. The voxel dimensions is 1 × 1 × 1 (Å). (b) The true Cα atom voxels (mask) extracted from the density map. (c) The predicted Cα model is overlaid with true Cα atom voxels.

Fig. 13
figure 13

(a) The cryo-EM density map of human p97 bound to UPCDC30245 inhibitor (EMD-3295) visualized at recommended contour level of 0.0375 (3.4 σ). The map dimension is 312 × 312 × 312 and has density values between 0.085−0.118. The voxel dimensions is 0.637 × 0.637 × 0.637 (Å). (b) The true Cα atom voxels (mask) extracted from the deposited density map. (c)The predicted Cα model is overlaid with true Cα atom voxels.

Usage Notes

Cryo2StructData is specifically designed for the task of building protein models from cryo-EM density maps. The prepared cryo-EM density maps available in Cryo2StructData Dataverse13,31 can be visualized using UCSF Chimera27. For the analysis of these maps, it is recommended to use the mrcfile package28 in Python.

The data preprocessing scripts in the Cryo2StructData GitHub repository can be modified or utilized to generate custom data tailored to user’s specific needs or reproduce the data generation process. The label generation process scripts are also included in the GitHub repository.

Cryo2StructData contains two types of data splits. The first type is resolution-based splits, and the second type is ID-based splits. We trained our deep learning model on the resolution-based splits dataset, as described in Table 2. Users have the option to use the splits we have created or can create their own splits based on their needs. Additionally, we provide a smaller downsampled version31 of the entire Cryo2StructData for users who do not have access to large computational resources. The split-based information is available in Cryo2StructData Dataverse34,35.

The provided trained deep learning models34,35 for atom and amino acid-type prediction, are essential for predicting voxels within the input cryo-EM density maps. The Cryo2StructData GitHub repository contains a deep learning inference program that utilizes these trained models for voxel-wise prediction. Our novel Hidden Markov Model as detailed in HMM-Guided Alignment of Cα Atoms and Protein Sequences (Section 16), a fast and efficient method for aligning predicted voxels to form a protein backbone structure in a single step, is written in C++ and is available in the Cryo2StructData GitHub repository. Additionally, both the Cryo2StructData Dataverse and the Cryo2StructData GitHub repository contain detailed information about the dataset and instructions for building the protein backbone model from cryo-EM density map, enabling users to seamlessly utilize the data and code. Additionally, we direct readers to the Cryo2Struct paper36, which utilizes Cryo2StructData for training its deep learning model. Cryo2Struct36 also provides comprehensive analyses of structures modeled from cryo-EM density maps.

The integration of ESM-239 generated sequence features into the protein density maps has yielded promising results, notably improving validation recall and precision for amino acid type prediction tasks, as shown in Fig. 6. This plot demonstrates the impact of incorporating sequence information into the density map. The sequence for each cryo-EM density map is made available within the Cryo2StructData Dataverse13,31, providing a convenient resource for researchers to utilize this complementary data and enhance the prediction capabilities of deep learning models.

The cryo-EM density map data structure is prepared as a 3D (three dimension) grid of values organized into layers of rows and columns. The voxel grids can be inferred as a regularly spaced point cloud where each point is a voxel. Other approaches, such as modeling input cryo-EM density maps as 3D point clouds and leveraging 3D Graph Neural Networks45, can be considered.

The dataset of this scale allows researchers to train and test robust and powerful deep learning models to predict the positions of protein backbone atoms (e.g., Cα atoms) and their amino acid types in 3D cryo-EM density maps, which can be linked together to build 3D protein models from scratch without using any known structural information as templates. The AI-powered de novo modeling of protein structures from cryo-EM density maps will significantly extend the capability and efficiency of cryo-EM techniques to solve the structures of large protein complexes and assemblies.