Paralog meta-analysis
To reanalyze the data from the 5 paralog screens, raw read counts were downloaded, and the same pipeline was applied to all of them. A pseudocount of 5 reads was added to each construct in each replicate, and total read counts were normalized to 500 reads per construct. Log2 fold change (LFC) for each guide at late time point was calculated relative to the plasmid sequence counts.
The data from each study (except Thompson) were divided into three groups; the constructs that target single genes paired with non-essential/non-targeting gRNAs (N) in the first position (gene_N), in the second position (N_gene) and constructs that target gene pairs (A_B). LFC values of each group were scaled individually so that the mode of each group was set to zero. Next, all three groups were merged in one table. Before dividing Ito’s dataset into three groups, LFC values were scaled such that the mode of negative controls (non-essential_AAVS1) would be zero and also TRIM family was removed from this dataset to avoid false paralog pair discovery13. Since in Thompson’s study there was just one position for singleton constructs, LFC values were scaled so that the mode of negative controls (non-essential_Fluc) was set to zero. In the next step, LFC of each construct was calculated by the mean of LFC across different replicates.
To calculate genetic interaction, single gene mutant fitness (SMF) was calculated as the mean construct log fold change of gene-control constructs for each gene. The control was either non-essential genes or non-targeting gRNAs. For each gene pair, the expected double mutant fitness (DMF) of genes 1 and 2 was calculated as the sum of SMF of gene 1 and SMF of gene 2. The difference between expected and observed DMF, the mean LFC of all constructs targeting genes 1 and 2, was called dLFC.
Next step was calculating a modified Cohen’s D between observed and expected distribution of LFC of gRNAs targeting genes. Expected distribution of gRNAs targeting a gene pair, was calculated using expected mean and expected standard deviation (std).
$${expected},{mean}=mu 1+mu 2$$
(1)
$${expected},{stand},{deviat}=sqrt{{({std}1)}^{2}+{({std}2)}^{2}}$$
(2)
$${S}_{{pooled}}=frac{sqrt{{({expected} , {std})}^{2}+{({observed} , {std})}^{2}}}{2}$$
(3)
$${Cohe}{n}^{{prime} }{sD}=frac{{expected},{mean}-{observed},{mean}}{{S}_{{pooled}}}$$
(4)
Where μ1 = mean LFC of gene1 constructs, μ2 = mean LFC of gene2 constructs, std1 = standard deviation of LFC of gene1 constructs, and std2 = standard deviation of LFC of gene2 constructs.
In each cell line, the paralog pairs with dLFC < −1 and Cohen’s D > 0.8 were selected as hits. Cohen’s D > 0.8 indicates large effect size between two groups, meaning that our expected and observed distribution of gRNAs are meaningfully separated. In total 388 paralog pairs were identified as hits across all the studies.
To identify the most consistent method in terms of hit identification, the Jaccard similarity coefficient of every pair of cell lines in each study was calculated by taking the ratio of intersection of hits over union of hits. For the studies that screened more than two cell lines, the final platform weight was the median of the calculated Jaccard coefficients of all pairs of cell lines.
$$Jleft(A,, Bright)=frac{left|Acap Bright|}{left|Acup Bright|}=frac{left|Acap Bright|}{left|A{{{{{rm{|}}}}}}+left|Bright|-{{{{{rm{|}}}}}}Acap Bright|}$$
(5)
To score paralog pairs, each hit was scored based on the cell lines in which it was identified as a hit; cell lines were weighted based on the platform weight described above. We defined the “paralog score” as the sum of platform weights of cell lines in which the paralog pair was identified as a hit minus the sum of platform weights of cell lines in which the paralog pair was assayed but not identified as a hit (a “miss”). The distribution of scores is shown in Fig. 1. Gene pairs with paralog score > 0.25 and were identified as a hit in two or more studies were listed as candidate gold standard paralog synthetic lethals.
One-component CRISPR/enCas12a vector
To construct an all-in-one vector for expression of both Cas12a and a guide array, we first swapped in puromycin resistance in place of blasticidin resistance from pRDA_174 (Addgene #136476). We then tested four locations for the insertion of a U6-guide expression cassette; notably this cassette needs to be oriented in the opposite direction of the primary lentiviral transcript to prevent Cas12a-mediated processing during viral packaging in 293 T cells. The construct with the best-performing location, between the cPPT and the EF-1α promoter, was designed pRDA_550 (Addgene #203398). Synthesis of DNA and custom cloning was performed by Genscript.
7mer library production
An oligonucleotide pool consisting of 7 Essential and 7 Non-Essential gene crRNAs with their nearby DR, BsmBI recognition as well as overhang sequence was synthesized by Integrated DNA Technologies. The pool was amplified by asymmetric PCR followed by being assembled into PRDA_550 vector to acquire the designed library through NEBridge® Golden Gate Assembly Kit (BsmBI-v2) (New England Biolabs). The assembled product was transformed into NEB® Stable Competent E. coli (High Efficiency) cells and the plasmid DNA was purified using the PureLink™ Plasmid Purification Kit (Invitrogen). Three oligonucleotide pools were cloned separately and pooled together to acquire the final 7mer library. The library was sequenced to confirm uniform and complete library representation.
Paralog selection for In4mer/Inzolia
Human paralogs and percent identity data were imported from BioMart, which reports both AB and BA percent identity (these can differ if the two genes encode proteins of different lengths) Mean percent identity ((AB + BA)/2)and delta percent identity (|AB−BA|) between paralogs were then calculated, and for the prototype library, paralogs with mean percent identity between 30% and 99% and delta percent identity <10% were selected (Supplementary Fig. 5). Next, CCLE expression data was downloaded, and the mean and standard deviation of expression across all CCLE samples was calculated for each gene. Paralogs where both genes had mean expression > 2 and stdev < 1.5 were selected (i.e. constitutively expressed genes).
Finally, to identify and include paralog families of size > 2, we applied a “difference from top paralog” filter. For each gene A in the pool, we identified its top paralog B by max sequence identity. Then for each other candidate paralog C, we calculated the drop in sequence identity, AB–AC (see distribution of drop % in Supplementary Fig. 5). For the prototype library, we defined A,B,C as being in the same family if AB−AC < 10%.
For the final Inzolia library, we relaxed several of these filters. The delta percent identity filter and the expression variance filter were removed entirely, and the difference from top paralog filter was expanded to 20%. The mean expression filter was retained. These three filtering steps resulted in a total of 4435 paralog pairs included in the Inzolia pool library.
In4mer prototype library production
Oligonucleotide pools consisting of designed four-plex guide arrays were synthesized by Twist Bioscience. The prototype pool consists of 43,972 arrays targeting 19,687 single genes, 2082 paralog pairs, 167 paralog triples, and 48 paralog quads.
5’-AATGATACGGCGACCACCGAcgtctcgAGATnnnnnnnnnnnnnnnnnnnnTAATTTCTACTATTGTAGATnnnnnnnnnnnnnnnnnnnnAAATTTCTACTCTAGTAGATnnnnnnnnnnnnnnnnnnnnTAATTTCTACTGTCGTAGATnnnnnnnnnnnnnnnnnnnnTTTTTTGAATggagacgATCTCGTATGCCGTCTTCTGCTTG-3’.
Italic: primer sequence Bold: BsmBI restriction sequence. Overhang in CAPS. nnnnn: guide sequence Underlined: DR sequence.
The pool of guide arrays was PCR amplified using KAPA HiFi 2X HotStart ReadyMix (Roche) using 20 ng of starting template per 25 μL reaction (primers are listed in Supplementary Data 10) and the following conditions: denaturation at 95 °C for 3 min, followed by 12 cycles of 20 s at 98 °C, 30 s at 60 °C, and 30 s at 72 °C, followed by a final extension of 1 min at 72 °C. The resulting amplicon was purified by the Monarch PCR & DNA Cleanup Kit (New England Biolabs) and cloned into the pRDA-550 vector by NEBridge® Golden Gate Assembly Kit (BsmBI-v2) The product from assembly reaction was purified and electroporated into Endura Electrocompetent cells (Lucigen). Transformed bacteria were diluted 1:100 in 2xYT medium containing 100 μg/mL carbenicillin (Sigma) and grown at 30 °C for 16 h. The plasmid DNA was extracted by PureLink™ Plasmid Purification Kit (Invitrogen). The library was sequenced to confirm uniform and complete library representation. The library was prepared in MD Anderson Cancer Center.
Inzolia library production
The final Inzolia pool consists of arrays targeting 19,687 single genes, 4435 paralog pairs, 376 paralog triples, and 100 paralog quads, plus 20 arrays targeting EGFP, 500 targeting intergenic loci, and 50 encoding non-targeting guides. Each array in the oligonucleotide pools is constructed as follows:
5’-AGGCACTTGCTCGTACGACGcgtctcgAGATnnnnnnnnnnnnnnnnnnnnTAATTTCTACTATTGTAGATnnnnnnnnnnnnnnnnnnnnAAATTTCTACTCTAGTAGATnnnnnnnnnnnnnnnnnnnnTAATTTCTACTGTCGTAGATnnnnnnnnnnnnnnnnnnnnTTTTTTGAATggagacgTTAAGGTGCCGGGCCCACAT-3’.
Italic: primer sequence Bold: BsmBI restriction sequence. Overhang in CAPS. nnnnn: guide sequence Underlined: DR sequence.
The pool of guide arrays was PCR amplified using NEBNext® High-Fidelity 2X PCR Master Mix (NEB) using 196 ng of starting template per 50 μL reaction (primers are listed in Supplementary Data 10) and the following conditions: denaturation at 98 °C for 1 min, followed by 7 cycles of 30 s at 98 °C, 30 s at 53 °C, and 30 s at 72 °C, followed by a final extension of 5 min at 72 °C. The resulting amplicon was purified by the Qiaquick PCR Purification Kit (Qiagen) and cloned into the pRDA-550 and pRDA-052 via Golden Gate cloning with Esp3I (Fisher Scientific) and T7 ligase (Epizyme). The assembly product was purified by isopropanol precipitation, electroporated into Stbl4 electrocompetent cells (Life Technologies) and grown at 37 °C for 16 h on agar with 100 ug/mL carbenicillin. Colonies were scraped and plasmid DNA (pDNA) was extracted via HiSpeed Plasmid Maxi (Qiagen). The library was sequenced to confirm uniform and complete library representation. The library was prepared in Broad institute.
Cell culture
K-562 and A549 cells were a gift from Tim Heffernan. A375 and MELJUSO were obtained from the Cancer Cell Line Encyclopedia. Cell line identities were confirmed by STR fingerprinting by M.D. Anderson Cancer Center’s Cytogenetic and Cell Authentication Core. All cell lines were routinely tested for mycoplasma contamination using cells cultured in non-antibiotic medium (PlasmoTest Mycoplasma Detection Assay, InvivoGen).
All cell lines were grown at 37 °C in humidified incubators at 5.0% CO2 and passaged to maintain exponential growth. For each cell line, the following medium and concentration of polybrene (EMD Millipore) and puromycin (Gibco) were used:
K-562: RPMI + 10% FBS, 8 μg/mL, 2 μg/mL
A549: DMEM + 10%FBS, 8 μg/mL, 2 μg/mL
A375: RPMI + 10% FBS, 1 μg/mL, 1 μg/mL
MELJUSO: RPMI + 10% FBS, 4 μg/mL, 1 μg/mL.
Cas12a screens
Lentivirus was produced by the University of Michigan Vector Core (prototype) or the Broad GPP (Inzolia). Virus stocks were not titered in advance. Transduction of the cells was performed at 1X concentration of virus with corresponding polybrene. Non-transduced cells were eliminated via selection puromycin dihydrochloride. The selection was maintained until all non-transduced control cells reached 0% viability. Once selection with puromycin was complete, surviving cells were pooled and 500x coverage cells were harvested for a T0 sample. After T0, cells were harvested at 500X coverage on corresponding days. The prototype In4mer screens were performed in MD Anderson Cancer Center. The Inzolia screens were performed in Broad Institute.
Prototype In4mer library genomic DNA preparation and sequencing
Genomic DNA (gDNA) was extracted using the Mag-Bind® Blood & Tissue DNA HDQ 96 Kit (Omega Bio-tek) and quantified by the Qubit™ dsDNA Quantification Assay Kits (ThermoFisher). Illumina-compatible guide array amplicons were generated by amplification of the gDNA in a one-step PCR. Indexed PCR primers were synthesized by Integrated DNA Technologies using the standard 8nt indexes from Illumina (D501-D508 and D701-D712) (Supplementary Data 10).
At least ~200X coverage gDNA per replicate across multiple reactions were amplified. Each gDNA sample was first divided into multiple 50 μL reactions with most 2.5ug gDNA per reaction. Each reaction contained 1ul each primer (10 μM), 1 μL 50X dNTPs, 5% DMSO, 5 μL 10X Titanium Taq Buffer, and 1 μL 10X Titanium Taq DNA Polymerase (Takara). The PCR conditions were: denaturation at 95 °C for 60 s, followed by 25 cycles of 30 s at 95 °C and 1 min at 68 °C, followed by a final extension at 68 °C for 3 min. After the PCR, all reactions from the same sample were pooled and then purified by E-Gel™ SizeSelect™ II Agarose Gels, 2% (ThermoFisher). Purified amplicons were quantified by Qubit™ dsDNA Quantification Assay Kits (ThermoFisher) and validated by D1000 ScreenTape Assay for TapeStation Systems (Agilent) (360 bp for in4mer, 501 bp for 7Mer). Purified amplicons were then pooled (with 30% customized random library to increase the diversity) and sequencing was performed by NextSeq 500 sequencing platform (Illumina) with custom primers (Integrated DNA Technologies) (Supplementary Data 10). The In4mer library was sequenced by read format of 151-8-8, single-end and the 7Mer library was sequenced by read format of 151-8-8-151, paired-end.
Inzolia genomic DNA preparation and sequencing
Genomic DNA (gDNA) was extracted using the Mag-Bind® Blood & Tissue DNA HDQ 96 Kit (Omega Bio-tek) and quantified by the Qubit™ dsDNA Quantification Assay Kits (ThermoFisher). Illumina-compatible guide array amplicons were generated by amplification of the gDNA in a one-step PCR. Indexed PCR primers were synthesized by Integrated DNA Technologies using the standard 8nt indexes from Illumina (D501-D508 and D701-D712). The sequences for the primer sets were listed in Supplementary Data 10.
At least ~200X coverage gDNA per replicate across multiple reactions were amplified. Each gDNA sample was first divided into multiple 100 μL reactions with most 10 μg gDNA per reaction. Each reaction contained 0.5 μL forward primer (100 μM), 10uL reverse primer (5 uM) 8 μL dNTPs, 5 μL DMSO, 10 μL 10X Titanium Taq Buffer, and 1.5 μL Titanium Taq DNA Polymerase (Takara). The PCR conditions were: An initial denaturation at 95 °C for 60 s, followed by 28 cycles of 30 s at 94 °C, 30 s at 52 °C, and 30 s at 72 °C followed by a final extension at 72 °C for 10 min. After the PCR, all reactions from the same sample were pooled and purified with Agencourt AMPure XP SPRI beads according to the manufacturer protocol (Beckman Coulter). Purified amplicons were quantified by Qubit™ dsDNA Quantification Assay Kits (ThermoFisher) and sequenced on a HiSeq2500 with a Rapid Run (200 cycle) kit (Illumina).
7mer screen data analysis
Reads for each reagent were counted using only exact matches to the entire 281 nucleotide 7mer sequence, excluding the leading DR (7 23mer spacer sequences + 6 20mer DR sequences). Fold changes were calculated relative to the mean of the T0 samples, and averaged across replicates. For each sample (T7/14/21), fold changes were normalized by subtracting the mean fold change of arrays with 7 nonessentials; i.e. setting no-essentials guides to zero.
We expected that the selected essential genes would not show any pairwise or higher order interactions, and thus should be governed by the multiplicative model of genetic interaction. To evaluate this model, we fit a regression model:
$$y , sim Abeta$$
(6)
where A is a binary matrix of 7mer guide arrays (rows, k = 384) by positions (columns, n = 7), with Ai,j = 1 if guide array i targets an essential gene at position j and 0 if not. (y) is the vector of normalized observed fold changes, and the n-length vector (beta) coefficients represent the single gene knockout phenotype learned from the model. We filtered this construct for reagents that encoded two or fewer essential genes (k = 87 rows). After linear fit, we compared the predicted zero, one, and two gene knockout fitness profiles (by summing the (beta) coefficients for each gene) to the mean observed knockout fitness. R2 values for each pool ranged from 0.78 to 0.91, and the overall quality of the linear fit supports the multiplicative model for non-interacting genes as assayed by combinatorial CRISPR knockouts of up to two genes. An accurate null model for noninteraction is critical for detecting and classifying deviations from this model that reflect positive or negative genetic interactions.
In4mer/Inzolia screen data analysis
In4mer library sequencing reads were mapped to the library using only perfect matches. BAGEL2 was used to normalize sample level read counts and to calculate fold changes relative to the T0 reference using the BAGEL2.py fc option with default parameters44. Essential and non-essential genes were defined using the Hart reference sets from refs. 39,41. Since the library targets both individual genes and specific gene sets (paralogs), we calculated the average gene/gene set (hereafter ‘gene’) log fold change as the mean of the clone-level fold changes across two replicates. All fold changes are calculated in log2 space. Cohen’s D statistics were calculated in Python as described in Paralog meta-analysis above. Data for recall-precision curves were calculated using BAGEL2. We set an arbitrary threshold of fc < −1 for essential genes.
For genetic interaction analysis, the expected fold change was calculated as the sum of the gene-level fold changes for each individual gene in the gene set. Expected fc was subtracted from observed fc to calculate delta log fold change, dLFC, where negative dLFC indicates synthetic/synergistic interactions with more severe negative phenotype, and positive dLFC indicates positive/suppressor/masking interactions with less severe negative or more positive phenotype than expected. We set an arbitrary threshold of dLFC < −1 for synthetic lethality, and >+1 for masking/suppressor interactions.
RAS synthetic lethal validation
An arrayed knockout apoptosis assay approach was adopted to validate RAS synthetic lethality in K-562. Two guides were selected for each of the three RAS genes, and two clones were designed for each target/gene combination. Guide RNAs were selected through CRISPick and gblocks (same construct as Inzolia library) were synthesized by Integrated DNA Technologies. The arrays were individually cloned into the pRDA_550 backbone and plasmids were validated by Sanger sequencing. The plasmids were then individually transfected to K-562 cells via the Neon Transfection System (Invitrogen). Each group was transfected with 2 μg of DNA per 2 × 106 cells, using the recommended setting for K-562 electroporation with one pulse at 1000 v, 50 ms. Non-transfected cells were eliminated through puromycin selection, which was maintained until non-transfected control cells reached 0% viability. Triplicate wells were maintained after selection until the end of the experiment. Cell viability, total cell numbers, live cell size and dead cell size data were collected through reading Trypan Blue (Gibco) stained cells via Countess II FL (Thermo Fisher) at each passage until 9 days after puromycin selection, in line with Inzolia screen end point of 8 days in K-562 cells. Percent dead cells were normalized to negative control and one-way ANOVA was conducted to compare experimental groups against the negative control for statistical significance.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
- SEO Powered Content & PR Distribution. Get Amplified Today.
- PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
- PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
- PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
- PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
- Source: https://www.nature.com/articles/s41467-024-47795-3