### Inclusion & ethics

We declare that our research complies with all relevant ethical regulations.

### Particle annotation using simplified labels

The supervised training of the DNN model of DeepETPicker requires pairs of subtomograms and their corresponding voxel-level masks/labels (Fig. 1a, c). Limited by the low SNRs and reconstruction distortion of tomograms, the manual voxel-level annotation of macromolecular particles is challenging and time-consuming. In this study, our goal is to identify particles rather than to obtain their full masks. To this end, we simplify the manual annotation process by only labeling the centers of particles, which is simple and efficient. Based on the annotations, three types of simplified masks centred on the labeled particles are generated as replacements for the real full masks, including Cubic masks (Cubic-M), Ball masks (Ball-M) and Truncated-Ball masks (TBall-M). Specifically, taking the centre of each particle as the origin, the corresponding simplified masks with sizes of (left[2r+1,2r+1,2r+1right]) are generated as follows:

$$M=big{(x,, y,, z),| ,x,, y,, zin [-r,r]cap Z big}$$

(1)

$${mas}{k}_{{cubic}}left(x,y,zright){{{{{{rm{|}}}}}}}_{left(x,y,zright)in M}=c$$

(2)

$${mas}{k}_{{ball}}left(x,y,zright){{{{{{rm{|}}}}}}}_{left(x,y,zright)in M}=left{begin{array}{c}c,{if},sqrt{{x}^{2}+{y}^{2}+{z}^{2}} , < , r 0 hfill ,{otherwise}end{array}right.$$

(3)

$${mas}{k}_{{tball}}left(x,y,zright){{{{{{rm{|}}}}}}}_{left(x,y,zright)in M}=left{begin{array}{c}c,{if},sqrt{{x}^{2}+{y}^{2}+{z}^{2}} , < , sqrt{2},r 0hfill , {otherwise}end{array}right.$$

(4)

where (c) is the class index. TBall-M ensures that the generated TBall-M is sufficiently different from Ball-M and Cubic-M (See Supplementary Methods A.3 for further information). The diameter of each generated mask is denoted as (d=2r+1), which should be no larger than the particle diameter. To ensure good particle picking performance, the diameter of the particles in the given tomogram should preferably be between 7-25 voxels. If the particle diameter is much larger than 25 voxels, proper binning operations can be used to keep the particle diameter within the recommended range. Examples of the three types of masks are shown in Fig. 1b and Supplementary Fig. 2a.

Compared to the real full masks of biological particles, these simplified masks can be seen as a class of weak supervision labels^{21,22}. Subsequent experiments show that DNN segmentation models trained by these simplified labels can effectively segment/detect particles of interest in tomograms. See Supplementary Methods A.3 for a more detailed discussion on the rationale behind the selection of the simplified masks.

### Architecture of the 3D segmentation model

The DNN segmentation model of DeepETPicker, called 3D-ResUNet, adopts an encoder-decoder architecture (Supplementary Fig. 2b). Specifically, the residual connection idea from 2D-ResNet^{23} is incorporated into 3D-UNet^{24} to better extract features from tomograms. The 3D-ResUNet architecture has 3 downsampling layers in its encoder and 3 upsampling layers in its decoder. Three-dimensional transpose convolution is used in the decoder to upsample feature maps. An ELU^{25} is used as the activation function to accelerate the convergence of the training process. To improve the localization of particles, coordinated convolution^{26} and image pyramid inputs^{27} are incorporated into 3D-ResUNet, which takes the voxel of each subtomogram as input and outputs (n) probability scores for (left(n-1right)) classes of structures of interest and the background, respectively, for each voxel. Coordinated convolution incorporates the spatial context of input images into the convolutional filters, while image pyramid inputs preserve features of input images at different resolution levels.

### Configuration for model training and validation

To improve the generalization capability of the segmentation model, data augmentation is used in the training stage. Specifically, the following transformations are performed on the training datasets: random cropping, mirror transformation, elastic deformation less than 5%, scaling in the range of [0.95, 1.05], and random rotation at angles within [−15(^circ,+15^circ)]. Training is performed using an AdamW optimizer^{28} with an initial learning rate of 10^{−3} and a weight decay of 0.01. A multi-class extension of Dice loss^{29} is used to calculate the difference between the predicted labels and ground-truth labels:

$${L}_{{Dice}}=1-frac{2mathop{sum }nolimits_{i=0}^{{N}^{3}}{p}_{i}{g}_{i}+varepsilon }{mathop{sum }nolimits_{i=0}^{{N}^{3}}{p}_{i}^{2}+mathop{sum }nolimits_{i=0}^{{N}^{3}}{g}_{i}^{2}+varepsilon }$$

(5)

where ({p}_{i}in {{mathbb{R}}}^{Ntimes Ntimes N}) denotes the labels predicted by the segmentation model, ({g}_{i}in {{mathbb{R}}}^{Ntimes Ntimes N}) denotes the ground truth, and (varepsilon={10}^{-8}) is a small value added for numerical stability.

A generalized form of the F1-score, ({F1}_{alpha }=frac{2Pcdot {R}^{alpha }}{P+{R}^{alpha }}), is used as a metric for model validation to place greater emphasis on model recall, where (alpha) is a hyperparameter. When (alpha=1), ({F1}_{alpha }) becomes the F1-score. When (alpha , > , 1), the model with higher recall (R) obtains a higher ({F1}_{alpha }). In this study, the hyperparameter (alpha=3) is used.

### Postprocessing using mean-pooling non-maximum suppression (MP-NMS) and overlap-tile (OT)

The value of each voxel in the score map generated by 3D-ResUNet denotes its probability of belonging to a certain class, which is in the range of [0, 1]. A specific threshold ({t}_{{seg}}) is selected to transform a score map into a binary map. A voxel whose value is below the threshold is labeled as 0 and otherwise as 1 so that a binary map is generated. Then, the proposed MP-NMS operation, consisting of multiple iterations of mean pooling (MP) and one iteration of non-maximum suppression, is performed on the binary map as the initial input. An example of MP-NMS applied on a 2D binary image with a size of (40times 40) pixels is shown in Fig. 1h. The first row shows the outputs of different iterations of MP operations performed on the binary image. After each MP operation, the voxels at mask edges are pulled closer to the voxel value of the background. As the number of MP iterations increases, all voxels of the mask are updated. Eventually, the binary mask is converted into a soft mask. The further a voxel in the mask is from the background, the larger its value. Each local maximum can be considered a candidate particle centre. The larger the local maximum, the higher the probability that it is a particle centre. MP-NMS can distinguish between the centers of multiple particles that partially overlap, as long as they have distinguishable features (Fig. 1h). Compared to clustering algorithms such as the mean-shift used in DeepFinder^{13}, the MP-NMS operation is substantially faster when accelerated using a GPU (Supplementary Table 6).

For an MP operation with a kernel size of (ktimes ktimes k) and a stride of 1, the receptive field of each voxel after the (i)^{th} iterations of MP operations is

$$,{{RF}}_{i}=1+left(k-1right) * i$$

(6)

To obtain the centroid of a particle with a diameter of (left(2r+1right)), the receptive field ({{RF}}_{i}) should be no smaller than the particle diameter. Thus, the minimum number of iterations of MP operations is (lceil frac{2r}{k-1}rceil), where (lceil cdot rceil) denotes the round-up operation.

To eliminate the negative influence of the poor segmentation accuracy achieved for edge voxels in subtomograms, an OT strategy is used in the inference stage. Taking the 2D segmentation case in Fig. 1h as an example and assuming that the image marked by the blue box is the output of the 3D-ResUNet model, only the centre region marked by the yellow box is considered during the inference stage to eliminate the poor segmentation of edge pixels. The size of the red box is determined using a hyperparameter termed ‘pad_size’. Each tomogram is scanned with a specific stride (s) and a subtomogram size of (Ntimes Ntimes N) in the inference stage, where (s=N-2cdot {pad_size}). Only the local maximum in the region of (left[{pad_size}:N-{pad_size},{pad_size}:N-{pad_size},{pad_size}:N-{pad_size}right]) is retained.

To reduce background interference and avoid repetition during particle detection, two further postprocessing operations are performed. First, the local maxima below a threshold ({t}_{{lm}}) are removed. Second, if the minimal Euclidean distance between two local maxima is lower than a specific threshold ({t}_{{dist}}), the smaller local maximum is discarded.

### Metrics for picked particles

To compare the performance of DeepETPicker with that of other competing state-of-the-art methods, three performance metrics are used:

precision *P*, recall *R*, and the ({F}_{1})-score *F*1^{8,30}, which are defined as follows:

$$P=frac{{TP}}{{TP}+{FP}}$$

(7)

$$R=frac{{TP}}{{TP}+{FN}}$$

(8)

$${F}_{1}=2cdot frac{{{{{{rm{P}}}}}}cdot {{{{{rm{R}}}}}}}{{{{{{rm{P}}}}}}+{{{{{rm{R}}}}}}}$$

(9)

where *TP*, *FP* and *FN* stand for true positives, false positives, and false negatives, respectively. For a particle with a radius of (r), its predicted label is considered positive if the Euclidean distance from its predicted centre to the ground truth is less than (r). Otherwise, it is considered negative. To measure the localization accuracies of particle-picking algorithms, the average Euclidean distance (AD) from the predicted particle centre to the ground truth is calculated in voxels.

For real experimental datasets without ground truths, to compare the authenticity and coordinates accuracy of the particles picked by DeepETPicker and other competing state-of-the-art pickers, we used the B-factor, global resolution, local resolution, log-likelihood distribution, and maximum value probability for evaluation.

The B-factor of a set of particles is computed by the Rosenthal-Henderson plot (RH plot)^{19}, which shows the inverse of the resolution squared against the logarithm of the number of particles. A higher B-factor means that a lower inclination and a larger number of particles are needed to reach the same reconstruction resolution.

Another metric is the global 3D reconstruction resolution. By refining two models independently (one for each half of the data), the gold-standard Fourier shell correlation (FSC) curve is calculated^{30,31,32,33,34} using the following formula:

$${{{{{rm{FSC}}}}}}left({{{{{rm{k}}}}}},triangle {{{{{rm{k}}}}}}right)=frac{{Real}left({sum }_{left(k,triangle kright)},{F}_{1}left({{{{{boldsymbol{K}}}}}}right){F}_{2}left({{{{{boldsymbol{K}}}}}}right)right)}{{left({sum }_{left(k,triangle kright)}{left|{F}_{1}left({{{{{boldsymbol{K}}}}}}right)right|}^{2}{left|{F}_{2}left({{{{{boldsymbol{K}}}}}}right)right|}^{2}right)}^{frac{1}{2}}},k=left|{{{{{bf{K}}}}}}right|$$

(10)

where ({{{{{boldsymbol{K}}}}}}) is the spatial frequency vector and (k) is its magnitude. ({F}_{1}left({{{{{boldsymbol{K}}}}}}right){,, F}_{2}left({{{{{boldsymbol{K}}}}}}right)) are the Fourier transforms of the reconstructions for the two independent halves of the datasets. The FSC_{0.143} cut-off criteria^{31} are used to calculate the global resolution.

In addition to the global resolution, the local resolution is another commonly used metric for evaluating the reconstruction map^{35}, which can be calculated in different ways by ResMap^{35}, MonoRes^{36}, DeepRes^{37}, etc. In this study, we use the ResMap algorithm implemented in RELION^{17} to analyse the local resolution.

Furthermore, we propose two new metrics based on the Bayesian theory of subtomogram averaging implemented in RELION^{38}. The approach of RELION aims to find the model that has the highest probability of being the correct one based on both the observed data and the available prior information. The optimization of *a posterior* distribution is called maximum *a posteriori* or regularized likelihood optimization. For a given dataset of picked particles, after *a posteriori* maximization, each particle is assigned two estimated parameters: one is called the log-likelihood to quantify its contribution weight to the final model, and the other is called the maximum value probability to quantify the accuracy of the particle parameter estimations (i.e., the orientation and the shift). The distribution statistics of the number of particles versus the log-likelihood and the cumulative statistics of the number of particles versus the maximum value probability are used in this study to evaluate and compare the particles picked by different pickers.

If the authenticity of the picked particles is worse, i.e., more false positive junk particles are picked, the SNR of the set of picked particles becomes worse. Then worse subtomogram averaging with reduced local and global resolutions is expected. Furthermore, a larger number of particles would be needed to reach the same reconstruction resolution. Thus a higher B-factor would be expected in case the local and global reconstruction resolutions may not be sensitive enough. It should be noted that if there is conformational heterogeneity in the specimen, the reconstruction resolutions, either local or global, may not be a good indicator to evaluate different pickers. More rigorous investigations using e.g., map inspection and 3D classification are needed.

More importantly, the authenticity and the coordinates accuracy of the picked particles can be assessed by the distributions of particle log-likelihood and maximum value probability. For a picked particle that is false positive or has a large deviation from its true centre, a lower log-likelihood and a lower maximum value probability would be calculated to down-weight its contribution to the final subtomogram averaging. Therefore, a larger number of particles with higher log-likelihood and maximum value probability indicate better coordinates accuracy of the picked particles and better authenticity of the picked particle set.

### Comparison among the particles picked by different methods

To compare the sets of particles picked by two different methods, a duplication removal operation is performed to calculate their intersection and difference sets. Specifically, if the minimal Euclidean distance between two particles is lower than a specific threshold ({t}_{{dist}}), which is normally set to half of the diameter of the particle, the two particles are considered the same. The intersection set contains particles picked by both methods, whereas the two difference sets contain particles picked by one but not the other method. For example, if we denote the set of particles picked by method *A* simply as *A* and the set of particles picked by method *B* simply as *B*, the particles in the intersection set (Acap B) are picked by both method *A* and method *B*. The particles in the difference set (A-B) are picked by *A* but not by *B*, whereas the particles in the difference set (B-A) are picked by *B* but not by *A*. Further explanations and illustrations of the intersection and difference sets are given in Supplementary Fig. 3.

### Datasets used for performance benchmarking

The performance of DeepETPicker is benchmarked on both simulated and real cryo-ET tomograms from six datasets: SHREC2020, SHREC2021, EMPIAR-10045, EMPIAR-10651, EMPIAR-10499, and EMPIAR-11125. The DeepETPicker hyperparameters used for these datasets are summarized in Supplementary Table 1. For each of the four experimental EMPIAR datasets, the overall workflow is to manually label the selected particles, use the labeled particles for model training, and, finally, use the trained model to pick particles from all testing tomograms. Detailed information on how each dataset is partitioned for training, validation, and testing is provided in the Supplementary Methods.

SHREC2020 is a dataset of simulated cryo-ET tomograms^{8}. It consists of 10 tomograms of cell-scale volumes. Each tomogram contains 12 classes of protein particles that vary in size, structure, and function. Ranked by their molecular weights from small to large, the Protein Data Bank (PDB) codes of the 12 classes of protein particles are 1s3x, 3qm1, 3gl1, 3h84, 2cg9, 3d2f, 1u6g, 3cf3, 1bxn, 1qvr, 4cr2 and 4d8q. Tomograms 0 to 7 are used for training, tomogram 8 is used for validation and hyperparameter optimization, and tomogram 9 is used for testing. DeepETPicker takes tomogram voxels as its inputs. For each voxel, it outputs 13 probability scores that correspond to the 12 protein classes and the background, respectively.

SHREC2021 is another dataset of simulated cryo-ET tomograms^{15}. Compared to SHREC2020, some major updates were made to the simulation process. Gold fiducial markers and vesicles were added to provide realistic additional challenges. SHREC2021 consists of 10 tomograms of cell scale volumes. Each tomogram contains 12 classes of protein particles that vary in size, structure, and function. Ranked by their molecular weights from small to large, the PDB codes of the 12 classes of protein particles are 1s3x, 3qm1, 3gl1, 3h84, 2cg9, 3d2f, 1u6g, 3cf3, 1bxn, 1qvr, 4cr2 and 5mrc. Tomograms 0 to 7 are used for training, tomogram 8 is used for validation and hyperparameter optimization, and tomogram 9 is used for testing. DeepETPicker takes tomogram voxels as its inputs. For each voxel, it outputs 15 probability scores that correspond to the 12 protein classes plus vesicles, gold fiducial markers, and the background, respectively.

EMPAIR-10045 is a real experimental cryo-ET dataset. It contains 7 tomograms of purified *S. cerevisiae* 80 S ribosomes^{17}. Each tomogram contains an average of 445 manually picked particles. The original tomogram and manually picked particle coordinates are contained in the subdirectory of the EMPIAR entry. Based on the aligned tilt series, ICON^{39} is used to reconstruct tomograms with better contrast for particle picking (Supplementary Fig. 4a). To reduce the computational cost and to increase the SNR, the tilt series are downsampled 4× before performing ICON reconstruction so that the diameter of the 80 S ribosome in the final tomogram is ~23-24 voxels. For particle picking and performance comparisons, four different methods are chosen, including DeepETPicker, crYOLO^{18}, DeepFinder^{13}, and TM^{9}. TM is performed by Dynamo^{40} with a reference map from EMDB entry EMD-0732 low-pass filtered to 60 Å (see the tutorial http://wiki.dynamo.biozentrum.unibas.ch/w/index.php/Walkthrough_for_template_matching). A total of 150 manually labeled particles are used for training and validation of DeepETPicker, crYOLO and DeepFinder (See Supplementary Methods A.8). The tutorials (http://cryolo.readthedocs.io/en/stable/tutorials/tutorial_overview.html and https://deepfinder.readthedocs.io/en/latest/tutorial.html) provided for crYOLO and DeepFinder are followed for model training and particle picking. Based on the obtained coordinates of ribosome particles, subtomograms are directly extracted from the original tomograms. Subtomogram averaging is performed (Supplementary Fig. 4a) by following the reported protocol using the same parameters^{17}, including CTF estimation, particle extraction, 3D classification (with one class only) and 3D autorefinement. The CTF model of each particle is generated using RELION scripts.

EMPAIR-10651 is a real experimental cryo-ET dataset of cylindrical T20S proteasomes from *Thermoplasma acidophilum*^{41}. It contains 3 tomograms of purified T20S proteasomes. Based on the aligned tilt series contained in the subdirectory of the EMPIAR entry, tomo3d is used to reconstruct the tomograms (Supplementary Fig. 4b). To reduce the computational cost and increase the SNR, the tilt series are downsampled 4× before performing tomo3d reconstruction so that the diameter of the T20S proteasome in the final tomogram is ~21 voxels. Similar to EMPAIR-10045, DeepETPicker, crYOLO^{18}, DeepFinder^{13}, and TM^{9} are chosen for particle picking and performance comparisons. TM is performed by Dynamo^{40} with a reference map from EMDB entry EMD-12531 low-pass filtered to 60 Å. A total of 142 manually labeled particles are used for training and validation of DeepETPicker, crYOLO and DeepFinder (See Supplementary Methods A.8). Similar to EMPAIR-10045, the model training and particle picking processes of crYOLO and DeepFinder are performed following the respective tutorials provided. Based on the obtained coordinates, subtomograms are extracted from the original tomograms. Then, subtomogram averaging is performed in RELION 2.1.0 (Supplementary Fig. 4b), including CTF estimation, particle extraction, 3D classification (with one class only) and 3D auto-refinement. The CTF model of each particle is generated using RELION scripts.

EMPIAR-10499 is a real experimental cryo-ET dataset of native *M. pneumoniae* cells treated with chloramphenicol^{42}. In this study, we focus on picking 70 S ribosome particles from these in situ tomograms. Ten tomograms (TS_77, TS_78, TS_79, TS_80, TS_81, TS_82, TS_84, TS_85, TS_87 and TS_88) from this dataset are selected for particle picking and verification purposes (Supplementary Fig. 4c). CTF estimation and motion correction are performed on the original movie stacks using Warp 1.0.9^{43}, and the tilt series, as well as the tilt angle files, are imported into IMOD 4.9.12^{44} for tilt alignment and tomogram reconstruction using the weighted back-projection algorithm with a radial filter cut-off of 0.35 and a fall-off of 0.05. To reduce the computational cost and increase the SNR, the reconstructions are downsampled 4× so that the diameter of the 70 S ribosome in the final tomogram is ~23-24 voxels. Again, DeepETPicker, crYOLO^{18}, DeepFinder^{13}, and TM^{9} are chosen for particle picking and performance comparisons. TM is performed by Dynamo with a reference map from EMDB entry EMD-21562 low-pass filtered to 60 Å. A total of 117 manually labeled particles are used for training and validation of crYOLO and DeepETPicker, and 703 particles are used for training and validation of DeepFinder (See Supplementary Methods A.8). Finally, RELION 2.1.0 (Supplementary Fig. 4c) is used to perform subtomogram averaging, including CTF estimation, particle extraction, 3D classification (with one class only) and 3D auto-refinement. The CTF model of each particle is generated using RELION scripts. The local resolution is directly calculated using RELION 2.1.0.

EMPIAR-11125 is an experimental cryo-ET dataset of *H. neapolitanus* alpha-carboxysomes^{20}. Three stacks (CB_02, CB_29, CB_59) are available from its EMPIAR entry for particle picking and verification purposes (Supplementary Fig. 4d). CTF estimation and motion correction are performed on the original movie stacks using Warp 1.0.9^{43}. Tilt alignment is performed using Dynamo (https://github.com/alisterburt/autoalign_dynamo). To reduce the computational cost and increase the SNR, the reconstructions produced by Warp are downsampled 8× and then used for particle picking so that the diameter of the alpha-carboxysome in the final tomogram is ~13 voxels. Again, DeepETPicker, crYOLO^{18}, DeepFinder^{13}, and TM^{9} are chosen for particle picking and performance comparison purposes. TM is performed by Dynamo with a reference map from EMDB entry EMD-27654 low-pass filtered to 60 Å. A total of 571 manually labeled particles are used for training and validation of crYOLO, DeepETPicker and DeepFinder (See Supplementary Methods A.8). Due to memory constraints, the final reconstructions are performed using 2× downsampled data in Warp. Then, RELION 3.1 beta is used for the subsequent subtomogram averaging step, including 3D classification (with one class only) and auto-refinement (Supplementary Fig. 4d).

### Statistics & reproducibility

In this study, 10 independent random replicates were produced on SHREC2021 datasets. Randomized two-sample t-test (rndttest2) as well as nonparametric ranksum test were performed for statistical comparison. For real tomograms, sample sizes were determined through experimental validation. No statistical method was used to predetermine sample sizes. DeepETPicker was evaluated across 4 experimental datasets, see detailed information in Supplementary Table 19. No data were excluded from the analysis. The material for reproducing the results within Figures and Supplementary Figs. is available in the Source Data file and the tutorials at https://github.com/cbmi-group/DeepETPicker.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

- SEO Powered Content & PR Distribution. Get Amplified Today.
- PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
- PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
- PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
- PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
- Source: https://www.nature.com/articles/s41467-024-46041-0