As depicted in Fig. 2, this paper focuses on three key biosecurity problem statements involving engineered DNA: (1) detecting the presence of unknown engineering in sampled DNA, (2) characterizing the function and purpose of engineered DNA, and (3) attributing engineered DNA to its origin. Analysis methods that solve these problems will be essential to formulating effective countermeasures to maliciously bioengineered agents, as unique customization within the vast potential threat landscape necessitate any response to be equally tailor-made. Within the review of relevant previous work, the most critical observation is that adaptability is necessary for biosecurity to keep pace with new scientific developments and fit real-world needs.
Detection
In right-of-boom scenarios, being able to identify that engineered organisms or viruses exist in key clinical or ecological environments is an essential first line of defense27. This need arises in both reactive and proactive cases where sequencing can already be employed to characterize natural DNA. Reactive cases involve DNA sequencing in response to an observable change, such as in diagnosing visibly sick patients28 or troubleshooting low crop yields. Proactive cases, on the other hand, rely on routine screening events such as water or food supply inspection29. Proactive scenarios pose a somewhat more difficult detection problem, since signs of DNA engineering cannot be tied to any yet noticeable problems. Still, any situation where the presence of unknown engineering is detected raises immediate red flags and both prompts and enables further analysis. The detection problem statement in a more general form predates the modern synthetic biology landscape, as concerns can be found in the literature dating back to the 1980s about engineered bacteria30, with similar questions being asked in intervening decades31. It has also been tackled by researchers interested in detecting genetically modified organisms (GMOs) in agricultural yields for the benefit of those fearful of consuming such products32,33,34. This problem is nevertheless very much unsolved due to a variety of challenges faced at several stages. First accurate sequencing reads must be captured from noisy environments, then abnormal reads associated with engineerable genomes must be filtered out from the larger dataset, and finally the abnormal reads must be analyzed to identify them as containing engineered sequences. Moreover, the vast variety of potential environments to survey include several types of human tissues, wastewater, soil, food supplies, and more35. Each unique environment features different physical challenges and organisms, further complicating the problem and hindering a universal solution.
A recent Intelligence Advanced Research Projects Activity (IARPA) program named Finding Engineering-Linked Indicators (FELIX) focused heavily on the detection problem, with the methods used varying between different groups36. Some work funded by FELIX aimed to improve the first step by prototyping an advanced portable sequencing kit that can amplify markers of genetic engineering37. However, this work is currently unpublished and this step of the problem is overall in need of much more focus. The second step, identifying rare abnormal reads from an enormous dataset, is tied to metagenomics, an adjacent but highly relevant field of research. Metagenomics focuses on accurately distinguishing between the sources of different DNA reads and sequences when taking a sample from a broad environment rather than a controlled subject38, forming a critical first step in identifying potential engineered DNA. Most metagenomic studies focus on ecological diversity and microbiome composition, but a significant amount is nevertheless closely linked and could be adapted to biosecurity purposes39,40. For example, several instances of clinically-inspired metagenomics work involve detecting pathogens from patient tissue samples using next-generation sequencing (NGS) data, in place of more traditional biopsies or chemical tests used to diagnose certain pathogen-caused illnesses28,41,42,43. Metagenomics is thus capable of identifying unusual DNA sequences with high sensitivity and specificity within clinically-relevant samples43. In a future where such a flagged sequence may signify an engineered agent of biological attack, this is extremely valuable to biosecurity. However, more investment is needed to be able to enable accurate metagenomic analyses in as many unique sampling environments as possible where threats could be found35. Fine-tuning of the thresholds to flag suspicious reads is also necessary to avoid either too many false positives to review or too many false negatives ignored in the context of what will always be an extremely noisy sample44. When fully developed to be adaptable to numerous contexts, metagenomic methods can serve as a key initializing step in a flagging protocol that provokes more detailed examination and whole-genome sequencing and analysis of suspicious samples45. This produces an ideal starting dataset of suspicious reads with maximum context and minimum contamination for downstream computational approaches46,47.
Other FELIX work focused on these downstream computational approaches in a variety of contexts and organisms. One group demonstrated the use of simulation-trained neural networks to detect genetic engineering in the relatively malleable genomes of model prokaryotes48. This approach targets prokaryotes because they have less complex epigenetic and regulatory interactions than multicellular eukaryotes, such as humans, resulting in a high quantity of possible edits. The authors chose to use a neural network trained to scan for numerous unique genetic patterns at once and confidently identify the presence of specific types of edits. These edits are representative of what various labs might do, such as genetic region inversion, gene deletions both full and partial, and insertion of fluorescent proteins48. Other work focused specifically on yeasts, important model organisms in genomic work and biomanufacturing applications49. One research group built a bioinformatics pipeline known as Prymetime that could assemble yeast genomes from whole-genome sequencing (WGS) reads and simultaneously detect and annotate signs of genetic engineering. This pipeline allows a user to evaluate if a given sample may have been engineered by an unknown party to either test some form of eukaryotic genetic design or produce a desired substance49. While the recent work in detection of engineered DNA is very promising, there exist some shared limitations that highlight the most pressing needs in this space. These limitations broadly manifest in the form of over-specificity; rigid methods typically work only for specific organisms, require specific assumptions about the sample to return accurate results, and perhaps most importantly, can only detect specific kinds of edits with known key features. The next big advances in this challenge should manifest in the form of more flexible computational tools that can be continuously updated and expanded. An example of such an approach is the tool GUARDIAN, created by several of the same researchers behind Prymetime50. GUARDIAN incorporates a collection of detection methods and combines their results together to get a more reliable consensus on if a sample is engineered50. Although this approach still has some limitations, such as only being able to reliably detect genetic inserts, modularization of the individual components in GUARDIAN and standardization of how their results are communicated enables more extensions, modifications, and incorporations of other methods going forward50. This is an example of the kind of adaptable approach that is necessary to match the rapid progress made in biotechnology and biosecurity policy developments going forward.
Characterization
Determining the purpose or function of engineered DNA detected in the environment, or any DNA being ordered for synthesis, is needed in order to perform an accurate threat assessment. Current DNA screening methods focus on identifying hazardous or pathogenic sequences, the most immediate and dangerous threats. However, some engineered DNA may not immediately appear to encode any destructive or adversarial function. Characterization also includes more advanced understanding of the genetic context that the DNA operates in. These subtler aspects of characterization are important when considering that some biological threats, such as a pathogen with an asymptomatic phase, could propagate without causing significant immediate impact. The simplest way to understand any whole is to understand each of its parts; this logic can be applied to engineered biology, and thus identifying specific genetic parts in an engineered sequence is an essential starting point. This facet of the characterization problem is manifest in DNA screening; sequences submitted for synthesis must be rigorously examined to determine if they encode anything particularly dangerous. Detecting the parts used in the construction of engineered DNA can critically highlight dangerous functions. This can be accomplished in several ways, the most basic being to simply run a full-scale best match search using algorithms such as BLAST against a list of pathogen genomes18. However, it is important to have more precise and extendable methods, which can identify specific genetic parts associated with virulence or pathogenicity while filtering out large, mostly harmless sequences, and then utilize these annotated parts in downstream analysis. The foundations for such methods are likely to come from the synthetic biology community itself, because there is already a vested interest for designers to have convenient tools for identification and annotation of specific genetic parts51. One notable example of such a tool that is open-source is PlasMapper52. Originally published in 2004, it has undergone significant revisions with the most recent updated version, 3.0, being released and published in 2023. Focused on plasmid design, its goals are oriented around clear web-based annotation and visualization of plasmid sequences to help users understand and identify key elements in their cloning vectors52. Other tools built around slightly different goals can produce annotated files in the GenBank or Synthetic Biology Open Language (SBOL) formats, which are more robust at capturing key sequence information51,53. These tools can be adapted to focus specifically on identifying potentially dangerous genetic parts and form a framework for detailed computational analysis of the offending sequence.
As the primary focus of current screening approaches, the characterization problem also needs the most amount of work dedicated to preventing deliberate subversion, such as by using de novo proteins or the masking of malicious coding sequences. One example of a novel and efficient DNA screening method that is specifically open-source, and explicitly tackles the risk of malicious actors deliberately obfuscating the dangerous elements of their sequences, is that of random adversarial threshold (RAT) screening54. RAT screening demonstrated effectiveness at stopping theoretical malicious synthetic biologists from sneaking past screening in a “red team” simulation. Notably, this method specifically relies on pregenerated predictions of potential variants and subsets of dangerous coding sequences. A key limitation is that if the capability of the “red team” to predict how to accurately modify sequences without losing functionality significantly exceeds the capability of the biosecurity protocol they are trying to breach, then their chances of success improve significantly54. This observation further underscores the importance that biosecurity methods need to be built on a foundation that is continuously adaptable to the latest discoveries and approaches, but also reinforces the positive point that knowledge is power, and advances in biodesign naturally lead to potential advances in biosecurity.
In the particular case of a dangerous de novo protein design, characterization analysis bears the burden of not only identifying the coding sequence within the DNA but also attempting the extraordinarily difficult task of figuring out what the protein actually does. Sequence-to-function prediction faces many challenges even when only looking at natural proteins, as the most straightforward approach of homology-based methods encounters numerous difficulties due to small sequence or structure differences yielding tremendously different functions55. Characterizing de novo designed proteins adds yet another layer of complexity, especially as they can now be generated from AI approaches11,20. One proposed way to logistically simplify this problem enormously is to ensure that all de novo protein design generation is monitored and cataloged, and continuously update screening databases accordingly56. However, as with screening itself, achieving this level of consensus regulation could end up being highly challenging. When directly faced with the problem itself, some neural network approaches, particularly convolutional neural networks (CNNs), have shown promise to predict de novo structure and function with reasonable accuracy57. One such tool is DeepGOPlus, which notably can be tweaked to search only based on motifs, enhancing its potential for analyzing relatively unknown or novel sequences58. However, it also does not account for protein interaction networks, which could limit its ability to identify a protein engineered to target a specific pathway or binding site58. Once again, as AI methods evolve in general to enable superior threat design by way of generating sequence from function, they should also enable superior threat analysis by way of generating function from sequence. However, this field of work still has a long way to go, as existing methods currently have limitations and are not optimized for the specific task of detecting potential engineered threats. Sometimes direct characterization of the resulting protein outcomes, such as by examination of a patient infected by a biological threat, will be necessary.
Beyond recognizing coding sequences of concern, another helpful part of characterizing suspicious engineered DNA is to understand functionality details. A useful analogy to draw here is to that of improvised explosive devices (IEDs), contemporary security threats highly relevant today due to the abundance of chemistry and instrumentation knowledge and materials. IEDs can be built using electrical circuits to trigger under certain criteria. In the future, a biological threat can similarly be configured using increasingly advanced synthetic biology methods, but simply identifying the circuit in question is significantly more difficult than the IED equivalent of physical examination. Computational methods exist to predict genetic circuit structure from a sequence51, and can be further sharpened with a focus on dangerous sequences. However, in the absence of rigid test parameters representing fully characterized phenomena that anchor simulations in other disciplines, simulation of the phenotype associated with unknown genetic engineering can be unreliable in even the highest quality in silico methods. As with the analysis of protein outputs, experimentally evaluating the properties of unknown DNA can elucidate genetic circuit mechanisms. The ideal approach here is akin to testing an electrical circuit with specific inputs and outputs in order to collect more practical data on function59. However, it is very difficult to design a workflow or platform to do this that is applicable in multiple contexts, especially unknown contexts, because divergent evolution has led to an enormous number of incompatibilities in interactions between different biomolecules. Furthermore, it is difficult simply to control the exact inputs of such an experiment precisely while minimizing confounding variables60. One notable example of careful control of inputs and measuring outputs in a genetic circuit involved adding promoters controlled by light to a genetic circuit and characterizing the resulting optogenetic circuit behavior based on analog light input signal strength59. In an analogy to electrical circuits, light-controlled oscillation generated waveform outputs, mirroring the use of a function generator and oscilloscope59. Other work implemented a cell-free system to induce circuit behavior independent of confounding factors from a live cell60. This significantly reduces the complexity of the experimental model for the circuit, while still preserving the ability to test how varying certain key parameters could influence gene expression output. The cell-free platform was emphasized as being a biological equivalent to a breadboard, as opposed to a function generator and oscilloscope, when those elements are often used in tandem to test electrical circuit designs60. Each of these two platforms is focused on facilitating design, but the concepts used also have the potential to be used for reverse engineering such designs. Both papers were also published in 2014, before numerous recent advances in synthetic biology and design. Although both have been highly cited in optogenetic and cell-free research, respectively, there is a gap in discovering the feasibility of applying these concepts to characterize an initially unknown genetic circuit, rather than testing and iterating on a purpose-built circuit.
Finally, a greater understanding of the context surrounding suspicious DNA sequences could plausibly be elucidated by identifying the methods by which it was designed and assembled. This is, however, an extremely difficult task as increasingly popular and accessible methods like Gibson assembly tend to not leave noticeable scars61, and the best that can be done is to try and identify certain areas that are associated with enabling certain kinds of edits; for example, the PAM sites specific to various Cas9-based platforms62. Furthermore, there is a lack of existing work that has focused on reliably demonstrating the ability to identify methods of assembly. This is a risky avenue of work as advances made in it are more likely to be overly specific and swiftly obsoleted by newer methods. Biosecurity approaches deriving from synthetic biology design as discussed above are more promising, as they can be more easily extended alongside developments in the original tools.
Attribution
Determining individual and sometimes vague characterizations of engineered DNA does not necessarily inform biosecurity experts on how to counteract a possible threat. Instead, detective work can potentially yield more conclusive results by discovering the origin of a suspicious sequence. By scanning for specific details that include not only certain overall build approaches as well as smaller details like promoter choice that are often innocuous independently, these small associations and clues can collectively form a best guess picture of who engineered the sequence in question63. This again can be compared to the case of IED threats, where certain patterns in the construction of devices could be considered hallmarks of a particular individual or organization. This problem is not necessarily a follow-up to the characterization problem, but rather one that can be tackled in tandem. Improvements in and insights gained from characterization can narrow down some of the detective work involved in attribution. At the same time, attribution can indirectly lead to better characterization. Correct identification of a creator can lead to an immediately greater understanding of the nature of engineered DNA when considering the history of the creator’s work. However, attribution comes with extreme sensitivity risks, as false accusation can lead to inflamed tensions and increase mistrust. Specific controls in DNA sequences explicitly designed to validate the identity of the creator can be extremely useful for avoiding these situations.
Recent work has demonstrated an important foundational step in detecting the lab-of-origin of an unknown sample63. This approach involves the use of training a deep neural network to categorize engineered sequences. The training and validation datasets were taken from Addgene sequence databases, beginning by selecting labs with a significant number of publicly available sequences and from there randomly selecting some sequences from each lab for either training or validation. The authors were able to demonstrate that their trained neural network could include the true lab-of-origin in its top 10 predictions more than half of the time, marking a decent accuracy standard that with improvement could be of great help in biosecurity63. Other work has sought to make more advances in neural network approaches by incorporating additional features of sequences to categorize, like phenotypic metadata64,65. These methods are effective and possibly highly future-proof as machine learning in general develops, but could run into application issues because of how it may be difficult to determine exactly why a given lab-of-origin is highly predicted. An alternative approach proposed an algorithmic solution to the lab-of-origin problem in place of neural networks66. Their tool, dubbed PlasmidHawk, also utilized the portfolios of labs with a high number of publicly available Addgene sequences. However, PlasmidHawk focuses on aligning a test sequence to a highly expansive pan-genome assembled from all synthetic sequences in the training dataset, and then identifying the most likely lab-of-origin based on the greatest number of successful alignments of significant sub-sequences of the test sequence. They reported overall greater accuracy of prediction over neural network approaches and were able to expand upon their analysis of prediction instead of having to deal with an intrinsic black box66. However, their work may also be more susceptible to becoming outdated as engineering methods evolve and neural network approaches are able to more effectively compensate by utilizing additional parameters64.
As multiple approaches to the lab-of-origin problem with different strengths and weaknesses continue to be developed, it is highly plausible that, as with traditional forensic work, the results of multiple analysis tools and tests can be utilized together to determine a most likely culprit. This is much like how the detection problem can be tackled using a concatenation of tools. However, fundamental limitations to all of these methods can still impede an investigation. For example, simply relying on sources like Addgene for designer patterns is a flawed assumption in the real world as one could reasonably assume that malicious actors will not publish their work in such public resources. Rather, it is possible to easily mask the true creator of a sequence by methods such as swapping a less important genetic part for a largely equivalent part primarily used by and thus strongly associated with another lab63. This would be an easy way to create a false red flag and frame others. It is precisely for this reason that the attribution problem includes not only the tracking of malicious actors, but also the verification of proper ones67. There are existing standards and methods for ensuring that labs working with particularly hazardous biological materials are trustworthy68, and these could be extended to aid the logistic side of biosecurity attribution.
User verification methods analogous to other security fields are enabled by the ability to leverage cryptography and its methods upon DNA data69. In particular, digital signatures can be implemented into non-coding DNA for the purposes of proper attribution; a specific section of sequence can be used to validate authorship by a particular person or organization70. This can lead to two distinct advantageous flagging scenarios in biosecurity. First, if a DNA signature appears in a sequence not claimed by the author of the signature, then the sequence may have been stolen or otherwise misused. Second, if a trusted author submits a sequence that does not contain their signature or contains a corrupted signature, this indicates that their computer system may have been compromised by an outside cybercriminal using their credentials to synthesize a threat, a notable novel angle of biosecurity attack26. This can help resolve and expedite security concerns associated with researchers conducting properly supervised and safe research on dangerous biomaterials. It also simultaneously provides a quick way to identify engineered sequences of very serious concern, as the stealing or spoofing of a sequence is a behavior likely to be associated with that of a malicious actor. There exist, however, some technical barriers with the technology, such as potential loss of function associated with inserting necessarily hundreds of base-pair length signatures into DNA, and the risk of mutations compromising the integrity of the signature. Additional work has sought to create DNA signature methodology that reduces these limitations, but still experiences some signature validation failures in their experimental results due to factors such as low sequencing quality leading to failed assembly70. As there can be serious consequences to get even a single detail wrong in cryptography, more work to expand upon and ensure near 100 percent reliability could increase the vaiability of this approach. However, there are also fundamental external barriers against cryptographic signature verification to consider, such as arguments over copyright and IP protection of genetic parts versus open source and the reproducibility of work that could benefit science as a whole71.
The issue of author verification can also be tackled from a more traditional cybersecurity standpoint. RAT screening is part of the initiative behind, and has been incorporated into, SecureDNA, a platform designed to facilitate universal, efficient, and effective biosecurity screening72. SecureDNA includes significant consideration for the privacy of users, employing cryptographic techniques to minimize the risk of potential trade secrets being leaked while still ensuring that thorough hazard screening is conducted. It also critically contains provisions for users with verified credentials and authorization to work with hazardous biomaterials to efficiently bypass the flags that their sequences will raise when screened72. This is an example of a useful early step in securing the biodesign process from a baseline computational level, but there is still a serious lack of widespread adoption of such methods.
A final concern about the creator of a sequence involves the use of AI by a designer without significant biological expertise. Currently, AI use in biodesign is primarily used by experts in areas like de novo protein design11,20 and metabolic engineering73,74,75,76. However, it is possible that AI use in the future could reach a point where an individual could gain significant knowledge about biosecurity weaknesses from LLMs19 and generate complete genetic designs according to vague initial specifications77,78. This particularly lowers the barrier of entry involved and raises the risks of an uninformed individual creating and ordering something that is dangerous, perhaps without even them realizing it. If AI is relied upon to generate entire circuits, it also may plausibly do so by referencing machine-accessible data from literature and public databases, thus producing attribution patterns that will resemble existing, legitimate researchers. By building tools specifically oriented towards detecting that AI was involved in the design of a DNA sequence, more specific questions could be leveraged to inform regulators about the risks of AI use in biodesign as well as determine the capabilities and motivations of the human behind the design. In other fields where the use of AI has led to controversy, including education and the arts, it has been found that separate neural networks trained to detect work produced by AI can be fairly accurate at doing so, though this could easily change79. Should AI tools evolve into a viable state such that even an uninformed individual could use them to engineer dangerous DNA sequences, biosecurity researchers should probe whether such classification outcomes are also true when said work manifests as DNA sequences, and do so continuously to be thoroughly aware of current capabilities.
- SEO Powered Content & PR Distribution. Get Amplified Today.
- PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
- PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
- PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
- PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
- Source: https://www.nature.com/articles/s41467-024-55436-y