Proteogenomic analysis reveals non-small cell lung cancer subtypes predicting chromosome instability, and tumor microenvironment

Human subjects with investigation of clinicopathologic features

A total of 229 samples, self-reported as Korean ethnicity, were histologically defined as NSCLC and selected for this study. Tumors and normal tissues adjacent to the tumor (NAT) were harvested under the Clinical Proteomic Tumor Analysis Consortium (CPTAC) guidelines from Asan Medical Center, Korea, with the approval of the Institutional Review Board of Asan Medical Center (Approval number: 2019-1210). In brief, we retrieved the 408 NSCLC cases whose cold ischemic time of fresh frozen tumor and NAT tissue sample was less than 15 minutes from the NSCLC cases of the bioresource center of Asan Medical Center deposited from January 2010 to March 2019. To investigate the impact of proteogenomic findings on post-operative therapy and metastasis, we preferentially selected 137 NSCLC patients showing lymph node metastasis at the pathologic examination of resection specimens. We further included 113 cases with absent lymph node metastasis at the time of surgical resection. With the further exclusion of 21 cases showing inadequate nucleic acid quality metrics, 229 patients were finally enrolled in the study. The clinical information of the enrolled patients, including age, gender, smoking history, adjuvant therapy, presence of post-operative recurrence, and treatment histories, was reviewed and documented by the thoracic oncologists (W.J.J, C.M.C., and J.C.L.). All the histopathologic slides of the resection specimens removed from the enrolled patients were independently reviewed by two thoracic pathologists (H.S.H. and S.J.J.) and documented for pathologic findings such as pathologic diagnosis, grade, tumor size, pleural invasion, histologic pattern (for non-mucinous adenocarcinoma, lymphovascular invasion, spreading through alveolar space, and lymph node metastasis. The cases with discrepancy between the independent assessments were discussed in consensus meetings. Based on these findings, all enrolled patients were re-staged according to the 8th lung cancer TNM stage67. In addition, pattern of tumor-infiltrating lymphocytes (TIL) was reviewed by the pathologists and classified according to the classification by Saltz et al.68.

Whole exome sequencing

Genomic DNA was isolated from a FFPE (formalin-fixed paraffin embedded) tumor and normal samples (NATs or blood buffy coat69 if NATs are not available). We generated whole exome sequencing (WES) libraries using SureSelect V6-Post (Agilent, CA, USA). Pooled libraries were run on the Illumina NovaSeq to obtain an average of 300x depth per tumor library and 100x depth per NAT and blood buffy coat library. The raw Illumina BCL (base calls) binary files were converted into Fastq files using the Illumina package bcl2fastq. Sequencing read data was checked for quality and adaptor/overrepresented sequence (ORS) by FastQC (v0.11.9). There was no ORS that could be a quality problem, and since the adapter sequence was less than 1% in all samples, the additional trimming process was skipped. Fastq files were aligned to the human reference GENCODE GRCh38.p13 v32 primary assembly genome by BWA (v0.7.17). Among the three algorithms of BWA, the latest BWA-MEM was selected as it is suitable for Illumina sequences ranging from 70 bp to 1 mbp with fast and accurate performance. After alignment, we sorted the BWA process output by coordinates via Picard SortSam and checked duplicate reads using Picard MarkDuplicates (v2.23.1). Two types of duplicated reads, optical duplicates (incorrectly detected as multiple clusters by the optical sensor of the sequencing machine) and library duplicates (created during PCR library preparation), were annotated for downstream analyses. For all tools requiring interval files, we used V6_S07604514_hs_hg38_S07604514_Covered.bed provided by Agilent. Because of systematic errors from the sequencing machine, base quality scores were recalibrated by the GATK4 BQSR algorithm (v4.1.8.0). The algorithm calculates the average Phred score by assuming the mismatch base is an error by matching known variants with the data base. We used the known variants VCF from dbSNP (broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf, broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf.gz) and the Mills and 1000 G project (broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz) from the GATK4 bucket (https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/). Then, the empirical Phred score is calculated using the average Phred score and then recalibration is performed based on the empirical Phred score.

Variant calling

We called somatic variants from the recalibrated BAMs using GATK4 Mutect2 (v4.1.8.0). In the Mutect2 pipeline, somatic-hg38_1000g_pon.hg38.vcf.gz data was used as panel-of-normals and somatic-hg38_af-only-gnomad.hg38.vcf.gz data was used for germline variants removal. We performed tumor with paired normal mode to exclude germline variants. As one tumor sample had no matching normal sample, the tumor only mode was performed. After calling somatic variants from GATK4 Mutect2, a read support reference for the well-known variant sites was created from the tumor recalibrated BAMs using GATK4 GetPileupSummaries. The tool requires a common somatic mutation sites VCF, so we used the somatic-hg38_small_exac_common_3.hg38.vcf.gz file created from the gnomAD resource from the GATK4 bucket (https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/). The GetPileupSummaries table was used in the GATK4 CalculateContamination to calculate the fraction of reads deriving from cross-sample contamination. The calculated cross-sample contamination data were saved as a table for each sample, and tumor segmentation data by minor allele fraction were additionally collected. The raw output of Mutect2 was converted into filtered VCF using cross-sample contamination table and tumor segmentation data by GATK4 FilterMutectCalls. The VCFs generated from GATK4 FilterMutectCalls were converted to annotated VCF and MAF files via GATK4 funcotator. We used GATK4 4.2.0.0 version solely for the annotation step to use the latest annotation data source for GATK4 funcotator. The latest pre-packaged data sources (funcotator_dataSources.v1.7.20200521 s) was downloaded from the GATK4 bucket (gs://broad-public-datasets/funcotator/). All annotations were performed in CANONICAL transcript-selection-mode after choosing a custom transcript list. Finally, the annotated MAF files were merged into one using maftools v2.8.0570.

Germline variants were called from the recalibrated BAMs using GATK4 HaplotypeCaller (v4.1.8.0). The reference confidence scores were confirmed in GVCF mode using the option -ERC GVCF, which was a reference model emitted with condensed non-variant blocks. For variant calls, -G StandardAnnotation and -G AS_StandardAnnotation annotation options were applied. All single sample GVCFs from GATK4 HaplotypeCaller were imported into GenomicsDB by GATK4 GenomicsDBImport (v4.1.8.0) for joint genotyping. Joint genotyping was performed using GATK4 GenotypeGVCFs (v4.1.8.0) from the constructed GenomicsDB. The produced joint VCF was recalibrated in variant quality by two steps. The first step was the recalibration of the SNPs. After calculating the exome specific recalibration score through GATK4 VariantRecalibrator (v.4.1.8.0), it was applied to the joint VCF using GATK4 ApplyVQSR (v.4.1.8.0). We used the training sets from the HapMap project (broad_hg38_v0_hapmap_3.3.hg38.vcf.gz), Mills and 1000 G project (broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz, broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz) and dbSNP (broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf) from GATK4 bucket (https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/). The second step was the recalibration of indels. The same tools in the first step were used and the training sets from Mills and 1000 Genome Project (broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz) and the dbSNP genotyping calls (broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf). The recalibrated joint VCF was converted to annotated VCF via the GATK4 funcotator. The 4.2.0.0 version of GATK4 was again used at the annotation step to utilize the latest annotation data source for GATK4 funcotator. The latest pre-packaged data sources (funcotator_dataSources.v1.7.20200521 g) was downloaded from the GATK4 bucket (gs://broad-public-datasets/funcotator/). All annotations were performed in CANONICAL transcript-selection-mode.

Identification of copy number alterations

DNA somatic copy number variations (CNVs) were detected using CNVkit v0.9.871. We labeled the copy number status with the following criteria: genes with an absolute copy number greater than or equal to 4 were labeled as “amplification”, and genes greater than or equal to 2.5 and less than 4 were labeled as “gain”. Similarly, genes with an absolute copy number from 0.5 to 1.5 were labeled “heterozygous deletion”, and those less than 0.5 were labeled “homozygous deletion”.

Copy number signature analysis and WGD detection

For copy number signature analysis and WGD detection, we used FACETS v0.16.072 to identify allele-specific copy number information. Preprocessed paired tumor-normal BAM files and a VCF file of common and germline polymorphic sites downloaded from https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/ were used as the input for FACETS. Two samples were dropped out due to quality problems; therefore, 228 samples in total were used for copy number signature analysis and WGD detection. We defined samples as ‘WGD-positive’ if greater than 50% of their autosomal genome had an absolute copy number greater than or equal to two.

Mutational signature analysis was performed using the COSMIC signature database v373 and R package Sigminer v2.1.574, as previously described75. Briefly, three copy number features, including total copy number, segment length, and loss-of-heterozygosity state were extracted and classified into 48 components that categorize the continuous values regarding the value range and biological significance. To decide the number of signature groups (or a factorization rank), non-negative matrix factorization (NMF) was performed using a tumor-by-component matrix with 50 runs checking 2 to 12 ranks. Based on the cophenetic score plot, we determined to use rank seven for copy number variants. Each signature was nominated by the COSMIC signature of the highest cosine similarity. Then, by performing hierarchical clustering, we assigned the samples to one of the signatures based on the consensus matrix.

RNA-seq data generation and preprocessing

Total mRNA-seq libraries were prepared using Library-TruSeq Stranded Total RNA with Ribo-Zero H/M/R_Gold. Pooled libraries were run on the Illumina NovaSeq6000 to generate an average of 200 million reads per library with 100-bp pair-end. The raw Illumina BCL binary files were converted into Fastq files using the Illumina package bcl2fastq. The sequenced read data was checked for quality and adaptor/overrepresented sequence (ORS) by FastQC (v0.11.9). There was no ORS that could be a quality problem, and since the adapter sequence was less than 1% in all samples, the additional trimming process was skipped. Fastq files were aligned to the human reference GENCODE GRCh38.p13 v32 primary assembly genome using STAR (v2.7.3a). After alignment, we sorted the output by coordinates via Picard SortSam and checked duplicate reads using Picard MarkDuplicates (v2.23.1). The reads that contain Ns in the cigar string were split using GATK4 SplitNCigarReads. Regarding systematic errors from the sequencing machine, base quality scores were recalibrated by the GATK4 BQSR algorithm (v4.1.8.0). The aligned BAMs from STAR were sorted by coordinates using Samtools Sort (v1.10). The Salmon algorithm (v1.4.0) was applied to quantify transcript-level expression, and raw read counts were produced. The output files (quant.sf) produced by Salmon, quantified transcript-level estimates, were imported to R using tximport and converted to a gene-level expression matrix. Then, we estimated the size factors using the “median ratio method” with the “estimateSizeFactors” function supported by DESeq2 R packages. With estimated size factors, we normalized the gene-level read count matrix using the “counts” function in DESeq2 R packages76, which divides the read counts by the size factors.

Isoform expression analysis

Isoform scale expression was quantified from transcriptome data using a pipeline including StringTie77. STAR-aligned BAM without going through GATK4 SplitNCigarReads in RNA preprocessing was used. Transcripts of each sample were assembled using StringTie v2.1.778. The first analyzed outputs were combined into a single assembly using merge command of StringTie (stringtie –merge), creating a single merged gtf containing the transcript of the entire sample. After that, a secondary analysis was conducted to calculate the expression in isoform level of each sample based on the merged transcript assembly using the StringTie Ballgown command (stringtie -eB). The novel isoform in the StringTie outputs were compared and annotated using gffcompare v0.11.679. Processed isoform gtfs are made into a single count matrix using predDE.py, a python script provided by StringTie.

The abundance or exon usage change between normal and cancer tissue was analyzed and visualized using IsoformSwitchAnalyzeR pipeline. IsoformSwitchAnalyzeR v1.17.480 pipeline includes DRIMseq81 to find isoform switch by condition and extract the sequence of mRNA or amino acid produced from the isoform. We analyzed predicted isoforms or ORF lists with additional tools as CPAT82, PFAM83,84, SignalP85 and IUPred2A86.

Fusion gene analysis

To detect the fusion genes, RNA-seq fastq was aligned, and fusion calling and filtering were performed. First, RNA-seq reads were mapped to the GENCODE GRCh38.p13 v32 primary assembly genome using STAR aligner v2.7.3a87. Unlike previous STAR mapping, ‘–chimOutType WithinBAM’ was included in the STAR output to include chimeric read to be suitable for use in arriba fusion calling. To increase sensitivity, the parameters were adjusted as recommended by the author as follows:

  • –chimSegmentMin 10

  • –chimOutType WithinBAM SoftClip

  • –chimJunctionOverhangMin 10

  • –chimScoreMin 1

  • –chimScoreDropMax 30

  • –chimScoreJunctionNonGTAG 0

  • –chimScoreSeparation 1

  • –alignSJstitchMismatchNmax 5 -1 5 5

  • –chimSegmentReadGapMax 3

We found fusion gene by applying Arriba v1.2.088 to the STAR mapped BAM file. Mismatches were discovered by comparing chimeric reads with reference genome assembly fasta and annotation gtf. Additionally, fusion genes had bad quality of frequently found healthy tissues were excluded from the analysis using the hg38 fusion blacklist provided by Arriba. Arriba outputs were labeled with sample name and tissue and concatenated with one table.

From the Arriba output, fusion genes with confidence of “low” was removed. At this time, the fusion genes appearing in normal tissues were collected and used as a blacklist for tumor tissue fusion genes. If the fusion genes found in tumor tissue were included in this blacklist, they were excluded from the analysis. Furthermore, fusion genes including known cancer-related genes (ALK, ROS1, RET, and PTK2)89 were annotated as “known fusion” and fusion genes found across several tumor samples as “recurrent fusion”. They were used for downstream analysis. After filtering, we visualized the structure and protein domain of fusion genes using the R script (”draw_fusion.R”) provided by arriba.

Protein extraction and tryptic digestion

For in-depth proteomic experiments, fifty milligrams of cryopulverized human NSCLC and NAT samples were homogenized in lysis buffer at a ratio of about 300 μL lysis buffer for every tissue. The lysis buffer consisted of 5% SDS, 50 mM TEAB (pH 7.55, Thermo Fisher Scientific, USA, 90114), protease inhibitor cocktail (1:100; Thermo Fisher Scientific, USA, 78430), 1 EA of PhosSTOP (Roche, Swiss, 4906845001), and 20 μM PUGNAc (Sigma-Aldrich, USA, A7229). Tissue lysis was performed with a couple of probe sonications using the Digital Sonifier SFX 550(Branson, USA). Sonication time was set at 30 s with a cycle of on-time 5 s and an interval of 3 s at 28%. Lysates were centrifuged at 14,500 g at 4 °C for 5 min, followed by measuring protein concentration using the BCA assay (Thermo Fisher Scientific, USA, 23225), and kept at -80 °C until used for analysis.

Protein lysates were reduced at 95 °C for 10 minutes with 20 mM 1,4-dithiothreitol (Roche, Swiss, 10708984001), followed by alkylation with 40 mM iodoacetamide (Sigma-Aldrich, USA, I6125) in the dark at RT for 30 minutes. 12% phosphoric acid (Sigma-Aldrich, USA, 695017) of 1/10 of the sample volume and 7 multiple volumes of S-trap binding buffer (100 mM TEAB in 90% MeOH) were added sequentially for formation of colloidal status. The solution was loaded on an S-Trap™ spin column (Protifi, USA) and centrifuged to 4000 g for 30 s. The captured protein was washed three times at 4000 g for 30 s with 400 μL of binding buffer. For tryptic digestion, 125 μL of digestion buffer (50 mM TEAB) containing Pierce Trypsin/Lys-C Mix (Thermo Fisher Scientific, USA, A41007) (w/w ratio of 1:25) was added and incubated at 37 °C for 6 h. The digested sample was sequentially centrifuged with 80 μL of digestion buffer, 80 μL of 0.2% formic acid (Honeywell, USA, 94318) at 1000 g for 60 s, and 80 μL of 0.2% formic acid in 50% acetonitrile (ACN) at 4000 g for 60 s. Eluted peptides were dried and stored at −80 °C until the next process. Samples were desalted using a C18 spin column (Harvard Apparatus, USA) and dried. A205 application of NanoDrop One (Thermo Fisher Scientific, USA) was employed for peptide concentration measurements. A baseline was established using 1 µL of HPLC water, and dried samples were each diluted with HPLC water and measured at 1 µL. For the common reference sample, 10 μg aliquots were taken from the sample being analyzed, pooled together, and used for each TMT batch.

Construction of the common reference pool

In-depth proteomic analyses for this study were organized as TMTpro 16 plex experiments. For comparative quantification between all samples across experiments, a common reference (CR) sample was involved at the 134 N channel in each 16-plex. The CR is a protein mixture that contains all samples analyzed in the TMT experiments. All subsequent procedures, including digestion, were performed along with other individual samples to minimize variation.

TMTpro 16-plex labeling

Peptides, 250 μg per sample (based on peptide level quantification with NanoDrop One), were labeled with 16-plex TMT reagents (Thermo Fisher Scientific, USA, A44520) according to the manufacturer’s protocol. For each peptide aliquot of an individual sample, 1 mg of labeling reagent was used. Dissolve each sample in 50 μL of 100 mM TEAB (pH 8.5), add 20 μL of labeling reagent dissolved in anhydrous acetonitrile, and incubate for 1 h with shaking. 1.8 μg of labeled peptides from each channel were taken out and subjected to LC-MS/MS analysis to confirm labeling efficiency before pooling. Label efficiency criteria were set as having a minimum of 95% fully labeled MS/MS spectra in each sample. To quench the reaction, 5 μL of 5% hydroxylamine was added and incubated for 15 minutes. Samples were pooled in each 16-plex experiment and sequentially desalted with Sep-Pak C18 3 cc Vac Cartridge (Waters, USA) and dried.

Mid pH reverse-phase liquid chromatography fractionation

To minimize sample complexity, samples were fractionated by mid-pH reversed phase (RP) separation using Shimadzu LC20A (Shimadzu, Japan) with an analytical column (XBridge Peptide BEH C18 Column; 300 Å, 5 µm, 4.6 × 250 mm, Waters, USA) and a guard column (SecurityGuard cartridge C18, 4 ×3.0 mm, Phenomenex, USA). We performed the peptide fractionation at a flow rate of 0.5 mL/min. Mobile phases A and B were 10 mM TEAB and 10 mM TEAB in 90% ACN, and the sample was dissolved in 0.1% formic acid in 90 μL. The LC gradient of mobile phase B was 5% in 8 min, 40% in 65 min, 44% in 69 min, 60% from 74 min to 88 min, and 5% in 90 min. We collected a total of 96 fractions from 8 min, pooling them into 24 non-consecutive fractions every 0.91 min. We then concatenated 95% of each fraction into 12 non-consecutive fractions for post-translational modifications (PTMs) analysis.

Phosphopeptide enrichment

For the phosphopeptide enrichment, immobilized metal affinity chromatography (IMAC) was used. 300 μL of Ni-NTA agarose bead slurry (QIAGEN, Germany, 30410) was resuspended in the tube. The slurry was spun down for 1 minute, and the supernatant was removed. The bead was washed three times with 1 mL of HPLC water and then incubated with 1.2 mL of 100 mM EDTA (Sigma-Aldrich, USA, E7889) for 30 minutes by end-over-end turning at RT. After three washes with 1 mL of HPLC water, the beads were incubated with 1.2 mL of 10 mM FeCl3 for 30 minutes with end-over-end turning at room temperature. 12 fractions for PTM analysis are reconstituted in 113 μL of 0.1% TFA in 50% ACN, and then 337 μL of 0.1% TFA in 95% ACN is added sequentially. Beads were resuspended in 140 μL of 1:1:1 ACN:MeOH:0.01% acetic acid solution and 10 μL of beads were aliquoted into 12 tubes for each fraction. We resuspended the beads in the sample solution to bind the phosphopeptides, and then gently mixed them for 30 minutes at room temperature. The supernatant was collected separately for the next acetyl peptide enrichment step. Beads coupled with phosphopeptides were dissolved in 180 μL of 0.1% TFA in 80% ACN and desalted using a C18 stage-tip to elute for LC-MS/MS.

Acetylpeptide enrichment

For the acetylpeptide enrichment, the PTMScan® Acetyl-Lysine Motif [Ac-K] Kit (Cell Signaling Technology, USA, 13416) was used. IMAC eluents were concatenated into four fractions and dried. Peptides were dissolved in 1.4 mL of 1 x IAP buffer, in which 10 x IAP buffer (5.78 g MOPS-NaOH, 1.461 NaCl, 0.8 g dibasic sodium phosphate, and 0.08 g monobasic sodium phosphate) were adjusted to pH with acetic acid ( ~ 1.4 to 1.6 mL). Agarose beads bound to the acetyl-lysine motif antibody were pre-washed a total of 4 times using 1 x IAP buffer under ice and split into 2 tubes with equal volume. The peptides were transferred to a tube and incubated at 4 °C for 3 hours with end-over-end turning. Each tube was centrifuged (2,000 g, 30 sec, 4 °C), and the supernatant was separated, followed by washing the beads twice with 1 mL of ice-cold PBS and three times with chilled HPLC water. 100 μL of 0.15% TFA was added to beads coupled with acetyl peptides and then incubated at RT for 10 minutes, gently mixed every 2-3 minutes, and then centrifuged to elute. We desalted the eluted acetyl peptides using a C18 stage-tip (IMAC procedure) and dried them after a total of two elution steps.

LC-MS/MS for proteomics analyses

For global proteomic analyses, the Ultimate 3000 RSLC nano system (Thermo Fisher Scientific, USA) coupled with the Q Exactive HF-X hybrid quadrupole-Orbitrap mass spectrometer (Thermo Fisher Scientific, USA) was used. Trap column (Acclaim™ PepMap™ 100 C18 LC Column, C18, 100 um x 2 cm, 5 μm, Thermo Fisher Scientific, USA) and analytical column (EASY-Spray™ LC Columns, C18, 75 um x 50 cm, 2 μm, Thermo Fisher Scientific, USA), which were heated to 50 °C to prevent over-pressuring, were equipped with UHPLC separation. Mobile phase flow rate was 0.3 μL/min, and solvents A and B were 97% water, 3% ACN, 0.1% formic acid, and 90% ACN, 0.1% formic acid, respectively. The gradient profile of solvent B was 2% in 12 min, 8% in 16 min, 25% in 140 min, 35% in 150 min, 85% from 155 min to 163 min, and 2% from 165 min to 185 min. Data-dependent acquisition was performed at a spray voltage of 1.5 kV. MS1 spectra were measured with a resolution of 120,000 and a mass range of 350 to 1500 m/z. The AGC target of 3e6, maximum injection time of 50 ms, and isolation window of 0.7 m/z were set. Top 15 most abundant precursors per cycle were selected to trigger MS/MS. MS2 spectra were measured with a resolution of 45,000, 1e5 of the AGC target, 120 ms of maximum injection time, 34% of collision energy, and 110 of the fixed first mass (m/z). The precursor charge state was 2-5, the intensity threshold was 1e4 and the dynamic exclusion time was 45 s.

For PTM analysis, the Ultimate 3000 RSLC nano system (Thermo Fisher Scientific, USA) coupled with the Exploris 480 orbitrap mass spectrometer (Thermo Fisher Scientific, USA) was used. Trap column (Acclaim™ PepMap™ 100 C18 LC Column, C18, 100 um x 2 cm, 5 μm, Thermo Fisher Scientific, USA) and analytical column (EASY-Spray™ LC Columns, C18, 75 um x 50 cm, 2 μm, Thermo Fisher Scientific, USA) were equipped with UHPLC separation. The mobile phase flow rate was 0.3 μL/min, and solvents A and B were 0.1 % formic acid in water and 0.1 % formic acid in ACN, respectively. The LC gradient of solvent B was 2% in 14 min, 4% in 17 min, 16% in 120 min, 25% in 145 min, 85% from 150 min to 158 min, and 4% from 160 min to 185 min. Data-dependent acquisition was performed at a spray voltage of 2.1 kV and the cycle time was set to 3 sec. MS1 spectra were measured with a resolution of 120,000 and a mass range of 350 to 1500 m/z. The normalized AGC target (%) was 300, and the maximum injection time was 50 ms. MS2 spectra were measured with a resolution of 45,000, 7.5 e4 of AGC target, 120 ms of maximum injection time, 38% of collision energy, and 0.7 m/z of isolation window. Peptides were selected for a 2-6 precursor charge state with an intensity threshold of 1e4. Peptides that triggered MS/MS scans were dynamically excluded from further MS/MS scans for 45 sec, with a +/- 10 ppm mass tolerance.

The database search strategy

In addition to canonical peptides, to reliably identify variant, post-translational modified, and novel peptides in global proteomics, we used a multi-stage search strategy: 1) the identification of canonical peptides (primary database search), 2) the identification of variant and modified peptides by considering 2355 modifications in Unimod, and 3) the identification of novel peptides. For each stage, the FDR was calculated separately using a target-decoy strategy, and identifications were obtained at 1% FDR. Only unidentified spectra from the previous stage were subjected to the subsequent stage. Details of each stage are described in the following sections: 1) “The Primary Database Search”, 2) “Identification of Variant and Modified Peptides”, and 3) “Identification of Novel Peptides”.

The primary database search

All MS raw files were analyzed using Proteome Discoverer v2.4 with the SequestHT search engine90 against the UniProt human protein database v2021.01 (97,795 entries), combined with 179 common contaminant proteins. The search parameters were set to 10 ppm for precursor mass tolerance, 0.02 Da for fragment mass tolerance, fully-tryptic enzyme specificity allowing for up to 2 missed cleavages, carbamidomethylation ( + 57.021 Da) on Cys residues and TMT modification ( + 304.207 Da) on the peptide N terminus and Lys residues for static modifications, and oxidation ( + 15.995 Da) on Met residues, Met-loss (-131.040 Da) on protein N-terminal Met residues and deamidation ( + 0.984 Da) on Asn and Gln residues for dynamic modifications. The minimum length of a peptide was set at six residues.

For phosphoproteome and acetylproteome, the search parameters additionally included phosphorylation ( + 79.966 Da) on Ser, Thr, and Tyr residues and acetylation ( + 42.016 Da) on the protein N terminus and Lys residues for dynamic modifications. The TMT modifications were set as dynamic modifications for acetylproteome, to prevent co-assignment of TMT and acetylation. All peptide spectrum matches (PSMs) were subsequently rescored by Percolator91 and validated at an estimated false discovery rate (FDR) of 1%. For phosphosite and acetylsite localization, ptmRS92 was used, and modified sites with ptmRS probability greater than 0.95 were regarded as confident.

Identification of variant and modified peptides

Using CustomProDB93, each patient’s germline and somatic mutations were used to generate variant protein sequences. The variant protein sequences were merged for each batch and combined with 16,130 UniProt human proteins identified from the primary database search and 179 common contaminant proteins to generate combined customized databases.

All MS raw files were converted to MGF files using msconvert v3.0.1, and the precursor m/z values were replaced with the recalibrated values by Proteome Discoverer in the primary database search (this recalibrated MS2 spectra were used in all subsequent searches). The unidentified MS2 spectra from the primary database search were searched against the combined customized databases using MODplus v1.0294. The search parameters were set to 10 ppm for precursor mass tolerance, 0.01 Da for-fragment mass tolerance, semi-tryptic enzyme specificity allowing for up to 2 missed cleavages, -1 to +2 for 13 C isotope error, carbamidomethylation ( + 57.021 Da) on Cys residues and TMT modification ( + 304.207 Da) on the peptide N terminus and Lys residues for static modifications. All modifications in Unimod database v2020.10 (2355 entries) including all amino acid substitutions were considered for dynamic modification (multiply modified peptides were allowed within the modified mass range of -150 to +350 Da). All PSMs were validated at an estimated FDR of 1% using MODplus FDR toolkit.

To identify neoantigen candidates, the proteomics-supported somatic mutations were filtered by examining the intensity of TMT reporter ion matched to the patient for whom the somatic mutation was called (the corresponding reporter ion intensity must account for at least 20% of the total intensity of all reporter ions). The immunogenicity of filtered somatic mutations was also evaluated by predicting binding affinity to human leukocyte antigen (HLA)-I molecules.

Identification of novel peptides

All unidentified MS2 spectra from the primary and subsequent variant database searches were analyzed to identify novel peptides originating from pseudogenes, long noncoding RNAs (lncRNAs), untranslated regions (UTRs), and novel isoforms (including fusion genes). For pseudogenes, lncRNAs, and UTRs whose RNA transcripts’ FPKMs are greater than 0, the RNA sequences were selected from the GENCODE transcript fasta files v32, and their three frame translations were generated for each patient, except for UTRs, whose translations were generated by the UTR sequence database construction method55, which assumes that alternative cognate and near-cognate start codons and translational readthrough in the stop codons can result in abnormal translation. These noncoding peptide sequences were merged for each batch and matched to MS2 spectra by MS-GF+ v2021.0995 with the following search parameters: 10 ppm for precursor mass tolerance, tryptic enzyme specificity allowing up to 2 missed cleavages (also permitting non-enzymatic terminals), no 13 C isotope error allowed, carbamidomethylation ( + 57.021 Da) on Cys residues and TMT modification ( + 304.207 Da) on the peptide N terminus and Lys residues for static modifications, oxidation ( + 15.995 Da) on Met residues and deamidation ( + 0.984 Da) on Asn and Gln residues for dynamic modifications, and TMT protocol. The minimum length of a peptide was set at 8 residues. All PSMs were validated at an estimated FDR of 1%.

The novel transcripts, including fusion genes, were predicted by StringTie v2.1.7 (class code =, c, k, m, n, and j)78,96 and Arriba v1.2.088, respectively, and their three frame translations were generated to compose the novel protein isoform sequences for each patient. The novel isoform sequences were merged for each batch and matched to MS2 spectra using Comet v2021.02.097 with the following parameters: 10 ppm for precursor mass tolerance, semi-tryptic enzyme specificity allowing up to two missed cleavages, no isotope error allowed, carbamidomethylation ( + 57.021 Da) on Cys residues and TMT modification ( + 304.207 Da) on the peptide N terminus and Lys residues for static modifications, and oxidation ( + 15.995 Da) on Met residues and deamidation ( + 0.984 Da) on Asn and Gln residues for dynamic modifications. The minimum length of a peptide was set at 8 residues. All PSMs were subsequently rescored by Percolator and validated at an estimated FDR of 1%.

We rejected PSMs conflicting with both MS-GF+ and Comet (i.e., identical spectra but different peptides assigned). The remaining novel peptides were searched using BLAST98 and filtered out if there were peptide sequence matches in the Uniprot (v2022.03, 226,999 entries), RefSeq (v2022.09, 130,184 entries) and/or Gencode (v41, 110,224 entries) human protein sequences, allowing no more than a single amino acid substitution. We also removed the frameshift peptides to concentrate solely on novel peptides derived from noncoding regions and novel isoforms. Finally, we used peptides supported by at least one RNA-Seq read and a quantifiable PSM to retrieve unique peptides at the gene level. In total, we obtained and used 1045 novel sequences for further analysis.

Quantification of Protein, Phosphosite and Acetylsite

Peptide abundances were normalized to have the same total amount of peptide abundances between TMT channel values in the same batch by Proteome Discoverer v2.4 with the SequestHT search engine90 against the UniProt human protein database v2021.01 (97,795 entries), combined with 179 common contaminant proteins. The co-isolation threshold 50% and the average reporter S/N threshold 10% were used to filter out quantification values of low quality. For global proteome quantification, proteins with at least one unique and/or razor peptides were exported, and normalization between batches was achieved by dividing each TMT channel value by that of the common reference (CR) channel in the same batch. The log2 ratio value was used for the subsequent quantification analysis.

The quantification of phosphosites and acetylsites was performed using the same method: the sum of peptide abundances containing each modification site was calculated to represent modification site abundance, and each TMT channel value was normalized using the same method as the global proteome data. After applying two-component normalization of TMT ratios for each proteomics dataset99, the log2 ratio value was used for the subsequent quantification analysis.

We selected features that have less than 30% missing values across all tumors for the global proteomics, phosphoproteomics, and acetylproteomics dataset using preprocessed normalized log2 ratio values. After then, we performed k-nearest neighbor imputation (k = 5) for missing values using the impute R package (https://doi.org/10.18129/B9.bioc.impute) for the subsequent quantification analysis.

NMF clustering analysis for multiomics subtypes

To identify patient subtypes, we performed non-negative matrix factorization (NMF) analysis on global proteins and PTM sites as instructed in previous CPTAC studies10,13. We concatenated preprocessed imputated global proteomics, phosphoproteomics, and acetylproteomics dataset into a single matrix. Then, we excluded features with the lowest standard deviation (bottom 5th percentile), followed by scale and z-score transformation. Using the NMF R-package100, we performed the NMF analysis for factorization ranks from 2 to 10 using the standard NMF algorithm from Brunet et al.101 and 2000 max iterations for convergence and repeated 50 times to compute clustering statistics. We determined the optimal factorization rank k, representing the number of clusters, from the rank with the maximal cophenetic correlation coefficient value and its drastic decrease. With the optimal factorization rank k, we sought to robust clusters from the NMF analysis using 200 runs and 5000 max iterations. From decomposed NMF matrices, we assigned samples to NMF clusters corresponding to the optimal factorization rank k and obtained features specific to each cluster according to the row-wise feature score102. We defined a “core sample” for those having a cluster membership score ≥ 0.5 as described previously10,13.

Combined NMF and Comparison of NMF subtypes with previous NSCLC subtypes

To assess the concurrence between our subtypes and lung cancer subtypes defined through recent multi-omics studies, we conducted NMF analysis by integrating our study cohort with previous CPTAC study cohort, which contains LUAD10 and LSCC13. Prior to running NMF, each study was subjected to pre-processing using the methodology outlined in the “NMF clustering analysis for multiomics subtypes” section. Features pertaining to proteomics, phosphoproteomics, and acetylproteomics datasets were examined, and only those identified as commonly present in all three studies were selected and concatenated into a single matrix for NMF clustering. Using the NMF R-package, we sought to robust cluster from the NMF analysis using 200 runs and 5000 max iterations with rank 5. We analyzed the subtypes derived from NMF decomposition matrices for each individual study within the study cohort samples of LUAD and LSCC and compared them with the subtypes identified in the combined NMF results.

To validate the robustness of our subtype classification in a recent NSCLC study10,11,13,14 comprising 462 patients, we conducted a comparative analysis of subtype features across multiomics datasets included in each individual study. To identify the features associated with the subtypes of each study, we performed re-clustering of NMF features using the omics datasets and parameters employed in prior studies and selected features utilizing the “Max” method102. Gillette study was conducted with RNA, global-, phospho- and acetyl-proteomics datasets for NMF, and used 4 ranks, 200 runs and 5000 max iterations. Satpathy study was conducted with CNV, RNA, global-, phospho-, and acetyl-proteomics dataset for NMF and used 5 ranks, 500 runs, and 5000 max iterations. Also, Lehtio study was conducted with only global proteomics dataset for NMF, and used 6 ranks, 100 runs, and 5000 max iterations. For Xu’s (2020) study, the up-regulated protein list for each subtype as presented was utilized as a subtype-specific features. To confirm a statistically significant relationship, an odds ratio and p value were calculated by conducting a Fisher’s exact test in R between each study cluster’s features.

Gene set enrichment analysis

To obtain characteristic analysis information corresponding to NMF subtype based on the expression data of multi-omics of tumor samples, we performed PTM signature Enrichment Analysis (PTM-SEA103) for phosphoprotoemics dataset and R package GSVA104 for global-, phospho- and acetyl protoemic datasets.

For GSVA, pathways from hallmark, KEGG, Reactome, GO and Wikipathway databases which were downloaded from MSigDB were considered and we used only pathways which contains more than 200 genes and lower than 1000 genes in each pathway.

database: “h.all.v2023.1.Hs.symbols.gmt.txt”, “c2.cp.kegg.v2023.1.Hs.symbols.gmt.txt”, “c2.cp.reactome.v2023.1.Hs.symbols.gmt.txt”, “c2.cp.wikipathways.v2023.1.Hs.symbols.gmt.txt”, “c5.go.v2023.1.Hs.symbols.gmt.txt”

For PTM-SEA, flank amino acid sequence (+/- 7 aa) was added to PTM data as primary identifier and used. We computed normalized enrichment scores (NES) of gene sets. We used the implementation which contains PTM-SEA available on GitHub (https://github.com/broadinstitue/ssGSEA2.0) using the command interface R-script (ssgsea-cli.R) using the following parameters:

  • database:”ptm.sig.db.all.flanking.human.v1.9.0.gmt”

  • sample. norm. Type: “rank”

  • output. score. Type = “NES”

  • nperm = 100

For identifying NMF subtype feature pathway, we select Hallmark pathway gene sets (MSigDB), which summarize specific cancer-related biological states or processes for ssGSEA and PTM signatures database (PTMsigDB) containing a list of modification site-specific signatures.

Identification of differentially enriched copy number variations

Gene-level copy numbers detected by CNVkit were converted to cytoband-level by calculating the mean copy numbers in each cytoband. Then we used a linear regression model to identify differentially enriched copy number variations in each NMF Subtype. To control the effect of the pathological diagnosis classification (DX), we set DX as a covariate.

$${Cytoband}-{level; copy; numbers}={beta }_{0}+{beta }_{1} cdot left({Subtype}right)+{beta }_{2} cdot left({DX}right)+varepsilon$$

β0 and β1 indicate the intercept and the log2FC value, respectively. Cytobands with an adjusted p-value (calculated using Benjamini–Hochberg method) less than 0.05 were selected as significant CNVs.

Identification of differentially expressed genes (DEGs)

To identify differentially expressed genes between conditions, we used “DESeq” function provided by DESeq2 R packages76. Quantified transcript-level estimates from Salmon were imported to R using “tximport” and were converted to a gene-level expression matrix. With this gene-level expression matrix, DESeqDataSet was generated using “DESeqDataSetFromTximport” function. Then, differential expression analysis based on the negative binomial distribution was performed using “DESeq” function. Notably, we conducted the DEG analysis using the formula below.

$${design},=, sim {Group},+, {DX},+, {rna}.{batch}$$

In this model, Group represent the conditions under comparison. DX is the name of the pathological diagnosis classification, including LUAD or LSCC. Batch information for RNA (rna.batch) was also used as a covariate in DEG analysis to remove batch effects from DEG analysis. The adjusted p-value was calculated using the Benjamini-Hochberg method, and an adjusted p-value less than 0.05 was used as a criterion for differentially expressed genes.

Identification of differentially expressed proteins, phosphorylation, and acetylation (DEP)

We identified differentially expressed proteins, phosphorylation, and acetylation using linear regression models as below.

$$Y ,=, {beta }_{0}+{beta }_{1}, cdot ({Group})+{beta }_{2} cdot left({DX}right)+{beta }_{2} cdot left({Batch}right)+varepsilon$$

yi is a vector of dependent variables that represents the expression values of each proteomic feature. Group is a vector of independent variables that represents the state we are trying to compare. DX is the name of the pathological diagnosis classification, and Batch is a TMT channel number that represents batch information. β0 is intercept, β1 is beta coefficient for Group variables, and represents the log2FC value. The linear regression’s calculated p-value tests the null hypothesis that the coefficient is equal to zero, indicating no difference between groups. The adjusted p-value was calculated using Benjamini–Hochberg method, and an adjusted p-value less than 0.05 was used as a criterion for differentially expressed proteins, phosphorylation, and acetylation.

Survival analysis

To compare the survival probability across our five NMF subtypes, we measured overall survival (OS) length (Supplementary Data 1a), the time from the date of tumor resection surgery to the time of NSCLC-induced death, for 229 patients. With NSCLC-death, we performed a survival analysis based on Kaplan–Meier estimation. In the Kaplan–Meier estimation model, NSCLC-death was used as an indicator of the ending time of the OS length. It is labeled as a right-censored observation in the case of patients with no deaths, which causes NSCLC at the end of the study. For patients who have died but whose cause of death is not NSCLC, we can use partial information that they survived beyond a certain point, but the exact date of NSCLC-death is uncertain.

The Kaplan-Meier estimation was used to measure survival curves for each subtype. The Kaplan-Meier survival curve is defined as the probability of surviving for each length of time after tumor resection surgery while considering time at many small intervals. For each time interval, survival probability is calculated as the number of patients who survived divided by the number of patients at risk. Patients who have died, dropped out, or moved out are not counted as “patients at risk”. The total probability of survival until that time interval is a cumulative probability, calculated based on the law of multiplication of probability by multiplying all the probabilities of survival at all time intervals until a certain point105. We used the log-rank test to statistically test the null hypothesis that there is no difference between the survival curves of each subtype. For the log-rank test, the total number of observed events in each subtype, i.e., O1 and O2 was used, and the expected number of events in each group, i.e., E1 and E2, was calculated. The total number of expected events is calculated as the sum of the expected number of events at the time of each event in any of the subtype, bringing all subtypes together. The expected number of events at the time of each event is the result of multiplying the total number of patients surviving at the time of events in all subtypes by the risk of events at time105. Log-rank test statistic is calculated as below.

$${Log}-{rank; test; statistic}=frac{{left({O}_{1}-{E}_{1}right)}^{2}}{{E}_{1}}+frac{{left({O}_{2}-{E}_{2}right)}^{2}}{{E}_{2}}$$

Since there are five subtypes in total, the pooled p value is calculated and shown in Fig. 2g. We perform pairwise comparison between Subtypes to decide which Subtypes represent poor prognosis subtype of NSCLC and adjusted p-value was calculated using the Benjamini–Hochberg procedure.

To identify the molecular features that result in survival probability differences, we also performed feature-wise survival analyses comparing survival curves between group of patients with top 50% expression and bottom 50% expression for each protein and PTM features based on Kaplan-Meier estimation. The prognostic direction, which shows whether the prognosis is favorable or unfavorable, was determined by the sign of coefficient from Cox proportional-hazard (CPH) model not from Kaplan-Meier model. Since in many cases, the survival probability does not reach 0.5, it was difficult to compare the median survival time that can be used as an indicator of prognostic direction in the Kaplan–Meier model. CPH model was used for univariate analysis and categorical analysis. h(t) is hazard function determined by the univariate, high- or low expression group. h0 is called the baseline hazard. It corresponds to the value of the hazard if x1 is equal to zero. The coefficient b1 measure the impact of univariate and tells us prognostic direction by sign. x1 is categorical variable, low- or high expression group. When sign of the coefficient b1 for high expression group is positive, its feature classified as unfavorable prognostic features and vice versa.

$$hleft(tright)={h}_{0}left(tright) times exp ({b}_{1}{x}_{1})$$

For significance, a log-rank test was used, and the nominal p-value < 0.001 was used as a criterion for survival-associated features. The results of the feature-wise survival analysis are listed in Supplementary Data 2d. We selected the top 1% of features by statistical significance for survival-associated features.

Kinase activity estimation based on phosphoproteomic data

To understand the Subtype-specific kinase activities, we used decoupleR R package to infer kinase activity from the results of differentially expressed phosphorylation (DEPP) using prior knowledge network, such as databases for kinase/phosphatase and substrate interactions106. We used prior knowledge network provided by PHONEMeS107. Also, we set minimum size of regulons to 5, and provided 1000 deregulated phosphorylation (each 500 phosphorylation for up-, down-regulated). There are a variety of methods to infer kinase activities and we used multivariate linear model (MLM)-based method as below:

$$Y={{beta }}{{X}}+{{rm{psi }}}$$

where the dependent variable Y represents the t-statistic measurements of phosphorylation from the results of the differentially expressed phosphorylation analysis, and the independent variable X is the connectivity matrix representing the associations with kinases. Xij equals 1 when phosphosite i is a known substrate of the kinase j, 0 otherwise. Ψ represents the normally distributed error of the fit and β represents the scores of the kinase activity108. In decoupleR packages in R, we used “run_mlm” function to estimate kinase activity. P-value was obtained from a multivariate linear model and the adjusted p-value was calculated using the Benjamini–Hochberg adjustment.

Single-cell specific subtype distribution

To assess the tumor microenvironment in five NSCLC subtypes by comparing subtype-specific genes with a diverse set of cell type-specific genes, using the dataset ‘local_extended.rds’ obtained from https://luca.icbi.at. Subtype-specific gene sets were generated based on the results of differential gene expression analysis (DEGs), including only genes with significantly increased fold change and a significant adjusted p-value (<0.05).

For the tumor versus tumor comparison, we considered only the cell type-specific genes originating from ‘tumor_primary’ cells. Conversely, for the tumor versus normal adjacent tissue (NAT) comparison, we used a subset of cell type-specific genes from both ‘tumor_primary’ and ‘normal_adjacent’ cells.

Clustering analysis for immune microenvironment

We used transcriptome data of NSCLC patients to define the subtypes of tumor immune microenvironment. Two clustering approaches were used, which were based on the enrichment score of cell types41 and immune-related pathways42. The cell type-based clustering was performed with 205 tumors and 85 normal samples. The enrichment scores of 64 cell types were inferred by Xcell41, which performed gene set enrichment analysis based on the curated gene signatures for each cell type. The inferred cell enrichment scores were then normalized as z-score and consensus clustering was performed using R package CancerSubtypes109. After the consensus clustering, the partitioning around medoids (PAM) algorithm was used to divide the three clusters. We first assigned a NAT-enriched cluster among the three clusters when it was mainly matched to normal samples. After that, the immune score calculated by Xcell was used to determine the HTE and CTE immune clusters, so that the cluster with the higher immune score was defined as having an HTE tumor. We also performed pathway-based clustering with 205 tumor samples. For that, GSVA was performed across the patients based on seven curated immune-related pathways. The enrichment scores of the pathways were normalized as z-scores, and k-means clustering was performed based on the z-scores to identify two immune clusters. The cluster having a higher pathway enrichment score was defined as HTE tumor.

Survival and regression analysis with immune cluster and cell types

The enrichment score of cell type and the status of the immune cluster across the patients were tested for their correlation with a set of clinical and molecular features. The pattern of TILs was examined by regression analysis using MASS package and GLM package of R. Overall survival and relapse-free survival were tested for their correlations with cell types and immune clusters by Cox proportional hazards regression analysis using CoxPHFitter package of Python. All analyses were performed with clinical histology and sample batches as covariates.

Identifying putative regulators associated with immune landscape

To identify putative regulators associated with immune landscape, we first inferred protein activity using transcriptome data of the patients. For the protein activity, we obtained regulon networks of lung cancer from ARACNe package110 of R and calculated protein activity of each patient using viper tool44 that inferred how each protein regulates its target genes based the regulon network. The positive value of protein activity indicates that the protein positively regulates its target genes to be overexpressed, and the negative value indicates vice versa. We then calculated the correlation between the protein activity of immunomodulators and the enrichment score of cells, or the status of the immune cluster. The RNA and protein expression of the same immunomodulators was also analyzed for their associations with the immune landscapes. To corroborate the putative regulation of the immune landscape, we further analyzed whether a specific driver mutation was involved in the association between immunomodulators and immune cells or immune clusters. The curated driver mutations were derived from Oncovar111, DriverDBv3112, Intogen113, and mutation catalogue from Martínez-Jiménez et al.114.

Immune landscape analysis with the integrated cohort

To test the reproducibility of our analysis about the immune landscape across NSCLC patients, we used two independent NSCLC cohorts of Satpathy et al.13 and Gillette et al.10. The z-score normalization was performed for the enrichment scores of 64 cell types in each cohort that had 202 (Satpathy et al.), 211 (Gillette et al.), and 290 (our cohort) patients. The normalized enrichment scores of cell types were then integrated into a single cohort and performed consensus clustering. PAM algorithm was used to make three clusters and they were mapped to NAT, HTE, and CTE, respectively, according to the method used in our original cohort.

Neoantigen and cryptic peptide prediction

We predicted different types of neoantigens and cryptic peptides according to their origin and validation (Supplementary Fig. 6a). Firstly, canonical neoantigens were predicted from the mutated peptides that had a length between 9mer and 12mer amino acids including somatic mutations. The binding affinity of the mutated peptides was predicted using MHCflurry115 and NetMHCpan116 for the patient HLAs that were identified by OptiType117. Only the mutated peptides to be predicted as binding to the patient HLAs were defined as the potential neoantigen. Among the neoantigen candidates, we further defined “confirmed neoantigens” when they had evidence of MS experiments (see the section “Identification of variant and modified peptides”). The remaining two types were defined by non-canonical peptides that originated from unconventional translation of pseudogenes, lncRNAs, UTRs, or novel isoform transcripts (Refer to section “Identification of novel peptides”). We identified novel isoform transcripts that were expressed only in tumor samples when they matched normal samples. The mean expression of normal samples was used when the tumor sample did not match normal. We treated a degree of FPKM below 1 as indicating no expression. Only the novel peptides that occurred in more than two tumor samples and not in at least one normal sample were predicted as “cryptic peptides” using the same methodology for neoantigen candidates in the prediction of binding to HLAs. We further determined “confirmed cryptic peptide” when the cryptic peptides also showed substantial expression (FPKM > = 1) in the tumor samples as described in the previous study118.

Analysis of multi-omics subtype distribution according to the cryptic peptide load and immune landscape status

We analyzed the distribution of the patients of each multi-omics subtype across the status of cryptic peptide load and immune landscape. Two features were chosen as the surrogates of the immune landscape, which were the status of the immune cluster (HTE/CTE) and the antigen processing and presentation machinery (APM). The patients were first separated into two groups that had a high and low number of cryptic peptides. They were also divided into two other groups: HTE and CTE, or high APM and low APM. The status of APM was determined based on the pathway enrichment score of “Antigen_Processing_and_Presentation” in KEGG pathway119. We created four categories based on the status of cryptic peptide load and immune landscape, and the degree of over-representation for each multi-omics subtype was calculated by a two-sided Fisher’s exact test.

Patient-derived lung tumoroid culture

Patient-derived lung tumoroid culture and drug screening were performed by SG Medical, Inc. (Seoul, Korea) and conducted as previously described120. Briefly, tissues were kept in a cold Hank’s balanced salt solution (HBSS) with antibiotics (Gibco, OK, USA) and on ice after dissection. Samples were washed three times with cold HBSS with antibiotics and sectioned with sterile blades. Sectioned samples were incubated with 0.001% DNase (Sigma-Aldrich, MO, USA), 1 mg/ml collagenase (Roche, IN, USA), 200 U/ml penicillin, and 200 mg/ml streptomycin in DMEM/F12 medium (Gibco, OK, USA) at 37 °C for 1 h with intermittent agitation. After incubation, the suspensions were repeatedly triturated by pipetting and passed through 40-μm cell strainers (BD Falcon, CA, USA). The strained cells were centrifuged at 112 × g for 3 min, and the pellet was resuspended in lung tumoroid culture media (DMEM/F12 supplemented with 20 ng/ml of bFGF (Invitrogen, CA, USA), 50 ng/ml human EGF (Invitrogen), N2 (Invitrogen), B27 (Invitrogen), 10 μM ROCK inhibitor (Enzo Life Sciences, NY, USA), and 1% penicillin/streptomycin (Gibco, OK, USA). The suspension was mixed with matrigel and seeded onto 6-well plates. Culture media were replaced every 4 days.

Drug screening

Lung tumoroids cultured were harvested and dissociated using TrypLE Express (Gibco, OK, USA). The dissociated lung tumoroids were diluted in a lung tumoroid culture media-matrigel mixture and seeded onto 384-well plates (250 cells per well). After lung tumoroid generation, 8 concentrations of Selinexor (Selleckchem, TX, USA) and vehicles (DMSO) as a negative control were added in triplicate. After 6 days, quantification of cell viability was done by adding 10 μl of CellTiter-Glo 3D (Promega) to each well according to the manufacturer’s instructions on a Varioskan LUX Multimode Microplate Reader (Thermo Fisher Scientific, MA, USA). The determination of IC50 values was conducted using GraphPad Prism.

CPTAC NSCLC data download and preprocessing

Clinical data, MAF file, and gene-level CNV data are downloaded from the supplementary tables provided by previous studies10,13. Segment-level CNV data is downloaded from the GDC data portal (https://portal.gdc.cancer.gov), and global proteome, phosphoproteome, acetylproteome, and clinical data, including survival information, were downloaded from LinkedOmics (http://www.linkedomics.org). For global, phospho-, and acetyl-proteomics data, we performed the imputation in the same way as in our cohort. Briefly, features with more than 30% missing values across all tumors were discarded, and k-NN imputation using k = 5 was performed. We conducted all downstream analyses using these data in the same manner as we did with our cohort.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.