Computerized tumor multinucleation index (MuNI) is prognostic in p16+ oropharyngeal carcinoma

Patients with p16+ oropharyngeal squamous cell carcinoma (OPSCC) are potentially cured with definitive treatment. However, there are currently no reliable biomarkers of treatment failure for p16+ OPSCC. Pathologist-based visual assessment of tumor cell multinucleation (MN) has been shown to be independently prognostic of disease-free survival (DFS) in p16+ OPSCC. However, its quantification is time intensive, subjective, and at risk of interobserver variability.

validation cohorts and very small sample sizes. In addition, these strategies rely on tissue-destructive sampling or require expensive methods for gene expression profiling. To date, however, no single signature has been validated as a practical predictor of DFS, OS, or distant metastasis-free survival (DMFS) in multiple cohorts, a critical step toward translating the signature into a clinical test. Reliable and sensitive prognostic tools for p16 + OPSCC could support clinical decision making. For instance, in high-risk patients, a safe de-escalation of treatment may not be possible, whereas, conversely, very low-risk patients could benefit from de-escalation and thus avoid adverse effects related to unnecessary intensive therapy (5).
Traditionally, tissue morphology and architecture within the tumor microenvironment (TME) have been shown to be reflective of tumor characteristics and to carry rich prognostic and predictive information across myriad histologic types (15)(16)(17). For head and neck squamous cell carcinoma (HNSCC), Hartman et al. reported that high numbers of tumor-infiltrating CD8 + T cells in the TME are associated with oropharyngeal localization and limited tumor growth and that the patients with high infiltration of CD8 + T cells also have significantly better outcomes (18). Lewis et al. previously visually identified anaplasia and multinucleation (MN) in the TME as novel prognostic and independently associated features in patients with p16 + OPSCC (Figure 1 and ref. 19). However, the recognition and quantification of morphologic features is time intensive and requires human interpretation, leading to interobserver variability and bias.
In this work, we present a computerized metric called the multinucleation index (MuNI), a quantification of MN density in epithelial (EP) regions, to risk-stratify patients with p16 + OPSCC for DFS, OS, and DMFS. Two machine-learning networks, specifically, conditional generative adversarial networks (cGANs) (20), were used for the MuNI calculations: (a) GAN MN to segment MN events and (b) GAN EP to segment cancer nuclei in EP regions in digitized H&E-stained whole-slide images (WSIs). By calculating the ratio of the MN events to the total number of EP nuclei identified on the slide image, we calculated a MuNI for every slide and every patient with p16 + OPSCC ( Figure 2). A large cohort of 1094 previously untreated patients with p16 + OPSCC obtained from 6 institutions was used to validate the prognostic ability of the MuNI. We performed univariate and multivariable analyses using DFS, OS, and DMFS as the clinical endpoints. The prognostic ability of the MuNI to predict the same endpoints was also evaluated within the individual stage I, II, and III groups, as defined by the AJCC's 8th edition.

Results
Patient demographics. Details on patient demographics for all cohorts from the individual sites are provided in Supplemental Table 1 (supplemental material available online with this article;

Introduction
The continued increase in the incidence of oropharyngeal squamous cell carcinoma (OPSCC) in the setting of declining rates of tobacco use has been attributed to HPV, with almost 70% of all OPSCCs being HPV + (1-3). HPV-related OPSCC has now overtaken cervical cancer as the most common HPV-related malignancy in the US (1,2). HPV positivity has been well demonstrated to confer favorable survival for patients with OPSCC (4,5). This has led to separate tumor, node, metastasis (TNM) staging systems for HPV-related (p16 + ) and HPV-unrelated OPSCC patients in the new 8th edition of the American Joint Commission on Cancer/ Union for International Cancer Control (AJCC/UICC) guidelines (3), with p16 IHC currently recognized as a suitable surrogate marker for HPV in these patients (4)(5)(6).
Clinically, tumor and lymph node (T/N) status along with smoking have previously been shown to influence the risk of recurrence and/or death for patients with p16 + OPSCC (4,5). However, these clinical parameters fail to capture intrinsic biological characteristics of p16 + OPSCC that may be relevant to treatment sensitivity and thus a clinical response. Machczynski et al. nicely discuss the limitations of even the new 8th edition of the AJCC guidelines regarding staging for p16 + OPSCC (7). Wuerdemann et al. found that the 8th edition's staging system did not discriminate well between patients with HPV + stage II OPSCC and those with HPV + stage III OPSCC (8). Similarly, no significant differences in OS have been found between the 8th edition's guidelines for (9) stage I versus stage II and (10) stage II versus stage III OPSCC.
Currently, the biomarker landscape for HPV-related OPSCC has largely focused on finding single-or multigene prognostic signatures. Verma et al. showed that the lack of STAT3 expression along with the increased expression of AP1 and NF-κB were associated with a better prognosis in p16 + OPSCC (11). Similarly, Balermpas et al. demonstrated that the presence of CD8 + and FOXP3 + T cells was prognostic of overall survival (OS) and disease-free survival (DFS) in p16 + OPSCC (n = 130; ref. 12). Consequently, some investigators have used genomic, epigenetic, and gene expression data to generate complex signatures of aggressive p16 + OPSCC biology (6,13,14). Unfortunately, the potential for clinical adoption of these studies is limited because of the lack of independent   The same deep-learning architecture, cGAN, was used to build the segregators. Each block in the cGAN models consisted of convolution, BatchNorm, and ReLU layers. The MuNI calculation phase started with the extraction of tiles from tissue regions. The tiles were then input into the MN and EP models separately. Finally, the MuNI was calculated automatically, which was the ratio of MNs within EP regions to EP cells. stantial toxicity and morbidity from these therapies. Consequently, there is a clear need to develop a quantifiable and reproducible biomarker to stratify high-and low-risk patients with p16 + OPSCC (23). Low-risk patients might then potentially benefit from therapy de-intensification, whereas high-risk patients would continue standard or intensified management.
Lewis et al. previously identified anaplasia and MN as novel prognostic features in patients with p16 + OPSCC (19). These were strongly and independently associated with disease recurrence and death from the disease and also correlated with DFS in a cohort of surgically treated patients with OPSCC (n = 149). However, identification of the above-mentioned morphologic features is pathologist dependent, and, although no specific study in the literature documents it, implies subjectivity and potential bias (24). We found specific examples difficult to discern and quantify, such as overlapping nuclei from separate cells that were not truly multinucleated and large, anaplastic, irregular nuclei that were also not truly multinucleated. Additionally, the study was performed at a single institution in a cohort of only surgically treated patients, for whom all slides of resected tumor, including lymph node metastases, were reviewed. Interestingly, in this study, the pathologist's quantification of MNs on the single H&E-stained slides for 478 patients with p16 + OPSCC was not found to be prognostic for the other institutions' cohorts (Supplemental Method 3), probably because of undersampling of the phenomenon on just a representative tumor slide.
The computerized MuNI presented in this work focused on addressing the issues related to tumor sampling and to subjectivity and inter-reader variability in MN interpretation. More critically, though, the MuNI is an independent prognostic marker of major clinical outcomes, OS, DFS, and DMFS. We validated the MuNI in a set of 1094 patients from 6 different institutions and found it to be strongly associated with DFS, OS, and DMFS. We identified a strong association between the predictions of the MuNI and OS, so was trained on S TR using age, sex, smoking status, TNM stage, and the MuNI to predict OS. We calculated the risk score for each patient for risk stratification. The median of the risk scores in the S TR group was defined as the cutoff to dichotomize low-and highrisk patients, and the same cutoff was used in the S VA group. The HR value for predicting OS on univariate analysis was 2.42 (95% CI: 1.86-3.15, P < 0.001) for S VA ( Figure 5). We conducted the same experiments using the clinical variables only, without the MuNI, and the corresponding HR was 1.43 (CI: 1.09-1.87, P < 0.01).
Experiment 4: evaluating the resilience of the MuNI against batch effects to account for site-specific preanalytic variations. We performed qualitative analysis of the MuNI resilience against batch effects. As a baseline, for each WSI coming from any of the 6 sites, image metrics related to image brightness and contrast were calculated using HistoQC. The features were then embedded into 2D space for visualization using the t-distributed stochastic neighbor embedding (t-SNE) algorithm ( Figure 6A). Likewise, we performed t-SNE mapping to calculate MuNI-specific statistics ( Figure 6, B and C). We assessed a total of 8 different statistics (MuNI of the entire WSI, mean, median, SD, minimum, maximum, and 33rd-66th percentiles) for the MuNI from across every tile of the WSI.

Discussion
Given the improved treatment response of patients with p16 + OPSCC, concerted efforts have been directed toward developing precision oncology approaches that include targeted de-intensification of radiation and chemotherapy doses and regimens (21,22). However, the unpredictable clinical behavior of p16 + OPSCC results in a significant risk that some patients will be over-or undertreated. The majority of patients with p16 + OPSCC are cured with current treatments, which include primary radiation, primary chemoradiation, or primary surgery with or without adjuvant radiation or chemoradiation. However, these patients experience sub- to exclude those who might have a high chance of treatment failure resulting in recurrence (25). Although multiple clinical trials are currently exploring therapeutic de-intensification strategies, they are limited by a dependence on clinical parameters to identify appropriate patients at low risk of disease recurrence (12). The identification of biologically meaningful markers of a good prog-DFS, as well as DMFS among the AJCC 8th edition's defined stage I and stage III patients in both univariate and multivariable analyses. If confirmed in a prospective clinical trial setting, we believe this finding could have major implications for clinical practice. Patients with stage I disease, currently the target of de-escalation treatment strategies, could be further stratified using the MuNI  nosis is of critical importance. Similarly, patients with stage III disease who are further categorized as high risk by the MuNI may merit the maintenance of treatment intensity by incorporating surgical resection, consistently utilizing concurrent chemotherapy, or intensifying chemoradiotherapy. Taken together, this would represent a novel, viable precision oncology approach to treating patients with p16 + OPSCC in the modern era.
In spite of the differences in clinical and pathological data between the sites (Supplemental Table 1), the MuNI was prognostic across the different sites using a single threshold cutoff learned from a single site, although with modest HRs for death. In Figure 6, the t-SNE of low-level image features such as color and texture extracted via HistoQC shows that each site clustered separately, indicating a large batch effect. On the other hand, the t-SNE using MuNI-specific statistics showed that slides from different sites were interspersed with one another, reflecting the resilience of the MuNI against site-specific preanalytic variations and batch effects. Additionally, as illustrated in Figure 6C, the MuNI was also able to enrich for patients who would develop tumor progression (progressors) versus those who would not (nonprogressors). MuNIs for the progressor group were also found to be statistically and significantly larger than those for the nonprogressor group (Supplemental Figure 8).
Quantitative histomorphometric (QH) approaches for the prognostication of disease outcomes have been previously proposed for many cancers. These approaches fall into 2 major categories: hand-crafted (or domain-inspired) and deep-learning-or neural network-based approaches. We have previously introduced 2 hand-crafted-based approaches, OHbIC (26) and QuHbIC (27), to stratify the risk of patients with head and neck carcinomas using H&E-stained tumor microarrays (TMAs). The first study showed the independent prognostic value of OHbIC, which utilizes nuclear shape and texture features for predicting disease-specific survival, in a cohort (n = 115) of patients with oral cavity squamous cell carcinoma (SCC). The latter showed that QuHbIC could predict the risk of recurrence in a cohort (n = 160) of patients with p16 + OPSCC by quantizing the spatial distribution of cell clusters. A second class of approaches of neural network-based deep-learning classifiers have become popular for cancer detection (28), diagnosis (29,30), and prognosis (31). Bulten et al. presented a grading system for prostate biopsies using deep-learning models and evaluated its performance in a set of 550 biopsies (29). Skrede et al. trained multiple deep-learning models at different magnifications and fused their output to predict the prognosis for colorectal cancer (n = 2042) (31). These approaches utilize deep networks to learn best representations for predicting prognosis categories of interest without requiring a pathologist's input. However, because of the multilayered, nonlinear structure of deep-learning models, they are considered black boxes, and their output is not interpretable by pathologists or translatable into any directly visual form. Interestingly, unlike these models, the method presented here utilizes the power of deep learning with the interpretability of hand-crafted (i.e., visually identified) features. In other words, deep learning was used not to make a direct prognostic prediction but rather to quantify MNs in WSIs and derive a prognostic metric based on the number of MNs identified on the WSIs. As demonstrated in experiment 4, the hybrid approach to identify a computational pathology-based biomarker was found to be resilient against batch effects.
Our study has some limitations. The cohorts from different institutions were found to have significant differences in their MuNIs, as well as in certain clinical and pathologic parameters (Supplemental Table 1 and Supplemental Figure 9). Nonetheless, we found that the MuNI was prognostic for the entire set of validation cohorts in both univariate and multivariable analyses, although it was not consistently prognostic for each of the separate cohorts. A possible explanation could be related to MN segmentation performance resulting in variants in the generated MuNIs across the different cohorts. Further evaluation of the sensitivity of the MuNI segmentation across sites is necessary. The MuNI was most strongly prognostic of outcomes in patients with stage III disease. This could be related to the number of patients within each stage. Since stage I and II patients had much better survival outcomes, irrespective of MN, it was more challenging to identify a difference between the high-and low-risk patients in these groups. A modest difference between the high-and low-risk groups could be observed among patients with stage I disease, since there were 471 patients, whereas in the group with stage II disease, which included 245 patients, that difference was not apparent. In the group of patients with stage III disease, whose overall prognosis was much worse, we could detect the difference, even though there were only 169 patients. Finally, this study was based on retrospectively collected data. Analyses of slides from completed multi-institutional, prospective clinical trials, or better yet, a prospective clinical trial with the MuNI embedded within it, are required to validate the findings, minimize the potential for bias, and determine whether the MuNI can specifically predict a patient's response to treatment.
In conclusion, the MuNI is a tissue-nondestructive, reproducible, rapid, and cost-efficient artificial intelligence-enabled (AI-enabled) biomarker with the potential to risk-stratify patients with p16 + OPSCC. The MuNI only relies on the quantitative measurement of MN tumor cells in digitized H&E-stained tissue from primary tumors, without the need for visual or manual segmentation of tumor versus nontumor regions. These specimens are already consistently obtained from patients with OPSCC in routine practice. This makes the MuNI potentially widely introducible into clinical practice at US institutions and useful in low-and middle-income countries, where the costs associated with genomics-based tests make them difficult to adopt and implement. Given that the MuNI is tissue nondestructive, further validation in retrospective clinical trials or prospective validation could make it a useful biomarker to guide the treatment of p16 + OPSCC.

Methods
Patient selection H&E-stained slides from oropharyngeal primary tumors were reviewed by pathologists at the respective institutions to confirm the sent to Case Western Reserve University, where they underwent WSI scanning at ×40 resolution (0.25 μm/pixel resolution) using a Ventana iScan HT slide scanner. WSI quality checking was performed using HistoQC (33), an automatic, rapid, and quantifiable quality control tool for computational pathology that excluded an additional 61 patients' specimens because of poor image quality. Finally, 1094 specimens remained for the analysis. The patients were divided into training and validation sets, S TR and S VA , respectively. D 1 comprised S TR (n = 171), which was used to build the segmentation models and to define the cutoff threshold for risk stratification. S VA (n = 923) was used for independent validation of the prognostic ability of the MuNI (D 2 = 106, D 3 = 121, D 4 = 97, D 5 = 322, D 6 = 277).

Tissue morphology analysis
The first step in calculating the MuNI was to automatically segment the WSI into EP and stromal regions, since MN is normalized by the total number of cancer cells identified within the EP. The segmentation was performed by means of a cGAN (20). The EP segmentation model (GAN EP ) was built and evaluated using a set diagnosis of OPSCC. Immunochemistry for p16 had been performed at the respective institutions in routine clinical practice, and only those tumor specimens that were classified as p16 + by nationally accepted pathologic standards (extensive nuclear and cytoplasmic staining present in 70% or more of the tumor specimen with at least moderate to strong intensity) were included (32). H&E-stained glass slides from the primary tumors of each patient were re-reviewed by the collaborating pathologists for the selection of the most representative tumor slide. If the patient was treated with primary surgery, this slide was from the resection specimen or, if treated with primary (chemo)radiation, then the best representative biopsy slides were selected.

Data set preparation
Data on a total of 1485 patients with OPSCC were collected from the 6 institutions. For simplicity, each institution was labeled D i , where i corresponds to the index of the site of origin. After reviewing the clinical data and p16 status, 391 of the patients were excluded from the study, 330 of whom were excluded because of their negative or equivocal p16 status or missing follow-up data (Figure 7). Slides were to identify and eliminate false-positive MN exemplars, given that MNs are typically larger compared with EP cells and lymphocytes. A detailed description of the network architecture and validation of GAN MN is provided in the supplemental material (Supplemental Method 1).

MuNI
Utilizing GAN EP and GAN MN , EP and MN masks were extracted for each WSI. For a WSI, m was used to denote the number of tiles extracted from the WSI and and corresponded to the number of detected MNs and EP cells in tile i extracted from the WSI, respectively. The normalized MuNI for the WSI was then defined as the ratio of total MNs to EP cells.

Equation 1
Additionally, different variants of the MuNI were also analyzed in terms of their prognostic ability. Further details are provided in the supplemental material (Supplemental Method 2).
Learning cutoff for the MuNI for the stratification of patients with high-or low-risk p16 + OPSCC. The MuNI is a continuous variable, necessitating a cutoff to stratify patients into low-and high-risk categories. A cutoff was determined as the mean value of MuNIs within the modeling set S TR and then applied to S VA to obtain dichotomized MuNIs. of 6 cases from S TR . A total of 153 image patches, each corresponding to 512 × 512 pixels, were cropped at ×10 magnification and then annotated by a pathologist. Of these, 102 were used for training GAN EP . Its performance was then evaluated quantitatively on the remaining 51 images and yielded a pixel-level F1 score of 0.88.

Automated detection and segmentation of MN
MNs were segmented using another cGAN model (GAN MN ) that involved 2 steps. Training of the GAN MN for nuclei segmentation was done using 30 images from a public data set corresponding to multiple organs (34). Patches of 256 × 256 pixels were extracted from these images at ×40 magnification and fed into the model during training. Segmentation performance was quantitatively and qualitatively verified to be suitable for MN segmentation. For independent validation of GAN MN , another publicly available data set corresponding to patients with triple-negative breast cancer was used (35). The pixel-level F1 score of GAN MN was 0.93 for this public data set.
In the second phase, GAN MN was trained to detect any cells, independent of the cell type, and was subsequently fine-tuned for the differentiation of MNs from other cell types such as EP cells and lymphocytes. MNs were annotated by a collaborating pathologist using 12 WSIs from S TR , which resulted in 1002 annotations. Nine WSIs with a total of 668 MNs were used for model training and the remaining 334 MNs for validation. The MN segmentation model yielded a pixel-level F1 score of 0.76 for the validation images. An empirically defined size threshold was used