Biomolecular machines have evolved to perform specific tasks through a concerted sequence of conformational motions. There is growing recognition that such motions involve continuous conformational changes, rather than jumps between a small number of discrete states1,2. Apart from disordered proteins, conformational continua span a spectrum of different energies. In thermal equilibrium, the probability of a conformational state being occupied is determined by the Boltzmann factor, which drops exponentially with increasing energy.
Conformational motions of proteins can thus be represented as low-lying (and thus strongly occupied) pathways on one or more energy landscapes (EL)3,4. In principle, an unlimited number of conformational paths connect a “start” conformation A to an “end” conformation B. However, most such paths include high-energy states, which are sparsely populated under biologically relevant conditions. Due to the exponential nature of the Boltzmann inverse relationship between energy and occupation probability5, lowest-energy conformational paths contribute maximally to function.
The growing recognition of the importance of energy landscapes for discerning function has spawned an increasing number of sophisticated algorithms capable of mapping continuous conformational motions. Using a synthetic dataset of cryo-EM snapshots with known ground truth energy landscape, we compare the performance of four leading algorithms, specifically Relion Multibody6, CryoSPARC 3DVA7, Manifold-EM8, and CryoDRGN VAE9, in faithfully extracting the energy landscape from snapshots. We benchmark the performance of each algorithmic approach in terms of the accuracy with which the correct energy landscape is recovered from the data.
To date, no comparative benchmarking study of the strengths and weaknesses of different data-analytical approaches for analyzing continuous conformations has been reported. The lack of comparative benchmarks hampers the assessment of the usefulness and reliability of different algorithmic tools in extracting information from experimental data.
In this paper, we use a synthetic dataset generated with an a priori known “ground-truth” energy landscape to benchmark, in silico, the above leading algorithms for conformational analysis of cryo-EM data. Specifically, we quantify each method’s ability to: (a) Recover the correct energy landscape from synthetic cryo-EM datasets; (b) Reveal the functionally important conformational degrees of freedom; and (c) Identify the functionally relevant conformational paths on these landscapes. Although the nature and number of potentially useful algorithms are currently evolving, the four selected approaches represent the state of the art in mapping continuous conformational motions from ensembles of single-particle cryo-EM snapshots.
The primary goals of this paper are thus twofold:
Benchmark the performance of the four leading algorithms listed above in faithfully extracting conformational energy landscapes from synthetic cryo-EM snapshots, and
Provide a well-characterized synthetic cryo-EM dataset suitable for comparative benchmarking, in order to facilitate the development of more effective data-analytical tools capable of identifying functionally relevant conformational landscapes and motions.
The synthetic dataset of three million cryo-EM snapshots (Signal-to-Noise Ratio SNR = 1) stems from a ribosome-like object with two conformational degrees of freedom, with an underlying energy landscape of 12 energy minima of various depths arranged on a 3 × 4 grid (Fig. 1a). The distribution of points (each representing a single snapshot) is determined by the underlying energy landscape. The landscape is spanned by two conformational degrees of freedom, specifically the rotations of the small subunit (SSU) about two axes named conformational coordinates 1 and 2. The SSU of the ribosome-like object is permitted to rotate in a ratchet-like manner about two mutually orthogonal axes, with the large subunit (LSU) fixed (Fig. 1b).
(a) The conformational landscape of the synthetic ribosome model along two conformational coordinates containing twelve wells (labelled 1 to 12) of uneven depths. The depths are reflected in the histogram along the two axes. (b) Real space representation of the cryo-EM density of the synthetic ribosome model indicating the two conformational directions.
The performance of each of the four data-analytical approaches is quantified in terms of the fidelity of the energy landscape recovered from the synthetic snapshots. This fidelity is quantified in terms of “Recall” and “Accuracy”, defined in terms of intrinsic distances between snapshots calculated using the Euclidean distance metric. (For details see “Methods” section entitled Accuracy metric).