


Pan-cancer integrative histology-genomic analysis via multimodal deep learning









The rapidly emerging field of computational pathology has demonstrated promise in developing objective prognostic models from histology images. However, most prognostic models are either based on histology orgenomics alone and do not address how these data sources can be integrated to develop joint image-omicprognostic models. Additionally, identifying explainable morphological and molecular descriptors from thesemodels that govern such prognosis is of interest. We use multimodal deep learning to jointly examine pathologywhole-slide images and molecular profile data from 14 cancer types. Our weakly supervised, multimodal deeplearning algorithm is able to fuse these heterogeneous modalities to predict outcomes and discover prognosticfeatures that correlate with poor and favorable outcomes. We present all analyses for morphological and molecular correlates of patient prognosis across the 14 cancer types at both a disease and a patient level in an interactive open-access database to allow for further exploration, biomarker discovery, and feature assessment.




Deep-learning-based multimodal integrationIn order to address the challenges in developing joint imageomic biomarkers that can be used for cancer prognosis, wepropose a deep-learning-based multimodal fusion (MMF) algorithm that uses both H&E WSIs and molecular profile features(mutation status, copy-number variation, RNA sequencing[RNA-seq] expression) to measure and explain relative risk ofcancer death (Figure 1A). Our multimodal network is capableof not only integrating these two modalities in weakly supervised learning tasks such as survival-outcome prediction butalso explaining how histopathology features, molecular features, and their interactions correlate with low- and high-riskpatients (Figures 1B–1E). After risk assessment within a patientcohort, our network uses both attention- and attribution-basedinterpretability as an untargeted approach for estimating prognostic markers across all patients (Figures 1B–1F). Our studyuses 6,592 gigapixel WSIs from 5,720 patient samples across14 cancer types from the TCGA (Table S1). For each cancertype, we trained our multimodal model in a 5-fold cross-validation using our weakly supervised paradigm and conductedablation analyses comparing the performance between unimodal and multimodal prognostic models. Following training andmodel evaluation, we conducted extensive analyses on theinterpretability of our networks, investigating local- andglobal-level image-omic explanations for each cancer type,quantifying the tissue microarchitecture corresponding relevantmorphology, and also investigating shifts in feature importancewhen comparingunimodal interpretability versus multimodalinterpretability.





Figure 1. Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE) workflow(A) Patient data in the form of digitized high-resolution formalin-fixed paraffin-embeded (FFPE) H&E histology glass slides (known as WSIs) with correspondingmolecular data are used as input in our algorithm. Our multimodal algorithm consists of three neural network modules together: (1) an attention-based multipleinstance learning (AMIL) network for processing WSIs, (2) a self-normalizing network (SNN) for processing molecular data features, and (3) a multimodal fusionlayer that computes the Kronecker Product to model pairwise feature interactions between histology and molecular features.(B) For WSIs, per-patient local explanations are visualized as high-resolution attention heatmaps using attention-based interpretability, in which high-attentionregions (red) in the heatmap correspond to morphological features that contribute to the model’s predicted risk score.(C) Global morphological patterns are extracted via cell quantification of high-attention regions in low- and high-risk patient cohorts.(D) For molecular features, per-patient local explanations are visualized using attribution-based interpretability in integrated gradients.(E) Global interpretability for molecular features is performed via analyzing the directionality, feature value, and magnitude of gene attributions across all patients.(F) Kaplan-Meier analysis is performed to visualize patient stratification of low- and high-risk patients for individual cancer types.

图1. 用于整合生存估计的病理组学研究平台(PORPOISE)工作流程(A)患者数据以数字化的高分辨率福尔马林固定石蜡包埋(FFPE)H&E组织学玻璃切片(称为WSI)及其相应的分子数据的形式输入我们的算法。我们的多模态算法由三个神经网络模块组成:(1)基于注意力的多实例学习(AMIL)网络用于处理WSI,(2)用于处理分子数据特征的自归一化网络(SNN),(3)多模态融合层计算克罗内克积,以模拟组织学和分子特征之间的成对特征相互作用。







Figure 2. Model performances of PORPOISE and understanding impact of multimodal training(A) Kaplan-Meier analysis of patient stratification of low- and high-risk patients via MMF across all 14 cancer types. Low and high risks are defined by the median50% percentile of hazard predictions via MMF. Log rank test was used to test for statistical significance in survival distributions between low- and high-riskpatients (*p < 0.05).(B) c-Index performance of SNN, AMIL, and MMF in each cancer type in a 5-fold cross-validation (n = 5,720). Horizontal line for each model shows averagec-Index performance across all cancer types. Boxplots correspond to c-Indices of 1,000 bootstrap replicates on the aggregated risk predictions.(C) Distribution of WSI attribution across 14 cancer types. Each dot represents the proportion of feature attribution given to the WSI modality input compared withmolecular feature input. Attributions were computed on the aggregated risk predictions in each disease model. Boxes indicate quartile values and whiskersextend to data points within 1.53 the interquartile range.See also Figures S1–S3, S11, S12 and Tables S1, S2, and S3.

图2. PORPOISE模型性能及多模态训练影响的理解

(A)通过MMF对所有14种癌症类型的低风险和高风险患者进行分层的Kaplan-Meier分析。低风险和高风险是通过MMF对风险预测的中位数50%百分位数来定义的。使用log rank检验测试低风险和高风险患者之间的生存分布的统计学显著性(*p < 0.05)。

(B)在5倍交叉验证(n = 5,720)中每种癌症类型中SNN、AMIL和MMF的c-Index性能。每个模型的水平线显示了所有癌症类型的平均c-Index性能。箱线图对应于在聚合风险预测上的1,000个bootstrap复制品的c-Index。



Figure 3. Quantitative performance, local model explanation, and global interpretability analyses of PORPOISE on clear cell renal cell carcinoma (KIRC)(A) For KIRC (n = 345), high attention for low-risk cases (top, n = 80) tends to focus on classic clear cell morphology, while in high-risk cases (bottom, n = 80), highattention often corresponds to areas with decreased cytoplasm or increased nuclear to cytoplasmic ratio.(B) Local gene attributions for the corresponding low- (top) and high-risk (bottom) cases.(C) Kaplan-Meier curves for omics only (left, ‘‘SNN’’), histology only (center, ‘‘AMIL’’), and multimodal fusion (right, ‘‘MMF’’), showing improved separation usingMMF. Logrank test was used to test for statistical significance in survival distributions between low- and high-risk patients (with * marked if p-Value < 0.05).(D) Global gene attributions across patient cohorts according to unimodal interpretability (left, ‘‘SNN’’) and multimodal interpretability (right, ‘‘MMF’’). SNN andMMF were both able to identify immune-related and prognostic markers such as CDKN2C and VHL in KIRC. MMF additionally attributes to other immune-related/prognostic genes such as RUNX1 and NFIB in KIRC.(E) Exemplar high-attention patches from low- (top) and high-risk (bottom) cases with corresponding cell labels.(F) Quantification of cell types in high-attention patches for each disease overall, showing increased tumor and TIL presence. Boxes indicate quartile values andwhiskers extend to data points within 1.53 the interquartile range.See also Figures S2–S11 and Table S4.

图3. PORPOISE在清晰细胞肾细胞癌(KIRC)上的定量性能、局部模型解释和全局可解释性分析

(A)对于KIRC(n = 345),低风险病例(顶部,n = 80)的高注意力往往集中在经典的清晰细胞形态学上,而在高风险病例(底部,n = 80)中,高注意力通常对应于细胞质减少或核质比增加的区域。


(C)仅组学(左侧,“SNN”)、仅组织学(中间,“AMIL”)和多模态融合(右侧,“MMF”)的Kaplan-Meier曲线,显示使用MMF可以改善分离效果。使用logrank检验测试低风险和高风险患者之间生存分布的统计学显著性(如果p-Value < 0.05,则标记*)。





Figure 4. Quantitative performance, local model explanation, and global interpretability analyses of PORPOISE in papillary renal cell carcinoma (KIRP)(A) For KIRP (n = 253), low-risk cases (top, n = 36) often have high attention paid to complex and curving papillary architecture, while for high-risk cases (bottom, n= 63), high attention is paid to denser areas of tumor cells.(B) Local gene attributions for the corresponding low- (top) and high-risk (bottom) cases.(C) Kaplan-Meier curves for omics only (left, ‘‘SNN’’), histology only (center, ‘‘AMIL’’), and multimodal fusion (right, ‘‘MMF’’), showing improved separation usingMMF. Logrank test was used to test for statistical significance in survival distributions between low- and high-risk patients (with * marked if p-value < 0.05).(D) Global gene attributions across patient cohorts according to unimodal interpretability (left, ‘‘SNN’’) and multimodal interpretability (right, ‘‘MMF’’). SNN andMMF were both able to identify prognostic markers such as BAP1 in KIRP. MMF additionally attributes to other immune-related/prognostic genes such asPROCR and RIOK1 in KIRP.(E) Exemplar high-attention patches from low- (top) and high-risk (bottom) cases with corresponding cell labels.(F) Quantification of cell types in high-attention patches for each disease overall, showing increased epithelial cell and TIL presence. Boxes indicate quartilevalues and whiskers extend to data points within 1.53 the interquartile range.See also Figures S2–S11 and Table S4.

图4. PORPOISE在乳头状肾细胞癌(KIRP)中的定量性能、局部模型解释和全局可解释性分析

(A)对于KIRP(n = 253),低风险病例(顶部,n = 36)通常会将高注意力集中在复杂且弯曲的乳头结构上,而对于高风险病例(底部,n = 63),高注意力集中在肿瘤细胞密度较高的区域。


(C)仅组学(左侧,“SNN”)、仅组织学(中间,“AMIL”)和多模态融合(右侧,“MMF”)的Kaplan-Meier曲线,显示使用MMF可以改善分离效果。使用logrank检验测试低风险和高风险患者之间生存分布的统计学显著性(如果p-value < 0.05,则标记*)。





Figure 5. Quantitative performance, local model explanation, and global interpretability analyses of PORPOISE on lower-grade gliomas (LGGs)(A) For LGGs (n = 404), high attention for low-risk cases (top, n = 133) tends to focus on dense regions of tumor cells, while in high-risk cases (bottom, n = 68), highattention focuses on both dense regions of tumor cells and areas of vascular proliferation.(B) Local gene attributions for the corresponding low- (top) and high-risk (bottom) cases.(C) Kaplan-Meier curves for omics only (left, ‘‘SNN’’), histology only (center, ‘‘AMIL’’), and multimodal fusion (right, ‘‘MMF’’), demonstrating improvement in patientstratification in MMF. Logrank test was used to test for statistical significance in survival distributions between low- and high-risk patients (with * marked if p-value< 0.05).(D) Global gene attributions across patient cohorts according to unimodal interpretability (left, ‘‘SNN’’) and multimodal interpretability (right, ‘‘MMF’’). SNN andMMF were both able to identify immune-related and prognostic markers such as IDH1, ATRX, EGFR, and CDKN2B in LGGs.(E) High-attention patches from low- (top) and high-risk (bottom) cases with corresponding cell labels, showing oligodendroglioma and astrocytoma subtypesrespectively.(F) Quantification of cell types in high-attention patches for each disease overall, with statistical significance for increased necrosis in high-risk patients. Boxesindicate quartile values and whiskers extend to data points within 1.53 the interquartile range.

图5. PORPOISE在低级别胶质瘤(LGGs)上的定量性能、局部模型解释和全局可解释性分析

(A)对于LGGs(n = 404),低风险病例(顶部,n = 133)的高注意力往往集中在肿瘤细胞密集区域,而高风险病例(底部,n = 68)的高注意力则同时集中在肿瘤细胞密集区域和血管增生区域。


(C)仅组学(左侧,“SNN”)、仅组织学(中间,“AMIL”)和多模态融合(右侧,“MMF”)的Kaplan-Meier曲线,显示使用MMF可以改善患者分层效果。使用logrank检验测试低风险和高风险患者之间生存分布的统计学显著性(如果p-value < 0.05,则标记*)。





Figure 6. Quantitative performance, local model explanation, and global interpretability analyses of PORPOISE on pancreatic adenocarcinoma (PAAD)(A) For PAAD (n = 160), high attention for low-risk cases (top, n = 40) tends to focus on stroma-contained dispersed glands and aggregates of lymphocytes, whilein high-risk cases (bottom, n = 40), high attention focuses on tumor-associated and myxoid stroma.(B) Local gene attributions for the corresponding low- (top) and high-risk (bottom) cases from (A) and (G).(C) Kaplan-Meier curves for omics only (left, ‘‘SNN’’), histology only (center, ‘‘AMIL’’), and multimodal fusion (right, ‘‘MMF’’), demonstrating SNN and AMILshowing poor separation of patients with low survival, with better stratification following multimodal integration. Logrank test was used to test for statisticalsignificance in survival distributions between low- and high-risk patients (with * marked if p-value < 0.05).(D) Global gene attributions across patient cohorts according to unimodal interpretability (left, ‘‘SNN’’) and multimodal interpretability (right, ‘‘MMF’’). SNN andMMF were both able to identify immune-related and prognostic markers such as IL8, EGFR, and MET in PAAD. MMF additionally shifts attribution to otherimmune-related/prognostic genes such as CD81, CDK1, and IL9.(E) High-attention patches from low- (top) and high-risk (bottom) cases with corresponding cell labels.(F) Quantification of cell types in high-attention patches for each disease overall, showing increased lymphocyte and TIL presence in low-risk patients, as well asincreased necrosis presence in PAAD. Boxes indicate quartile values and whiskers extend to data points within 1.53 the interquartile range.

图6. PORPOISE在胰腺腺癌(PAAD)上的定量性能、局部模型解释和全局可解释性分析

(A)对于PAAD(n = 160),低风险病例(顶部,n = 40)的高注意力往往集中在含有分散的腺体和淋巴细胞聚集的基质内,而高风险病例(底部,n = 40)的高注意力则集中在与肿瘤相关的粘液样基质上。


(C)仅组学(左侧,“SNN”)、仅组织学(中间,“AMIL”)和多模态融合(右侧,“MMF”)的Kaplan-Meier曲线,显示SNN和AMIL在显示低生存率患者的分离效果差,而多模态融合后的分层效果更好。使用logrank检验测试低风险和高风险患者之间生存分布的统计学显著性(如果p-value < 0.05,则标记*)。





Figure 7. TIL quantification in patient risk groups

TIL quantification in high-attention regions of predicted low- (BLCA n = 90, BRCA n = 220, COADREAD n = 74, HNSC n = 96, KIRC n = 80, KIRP n = 36, LGG n =133, LIHC n = 85, LUAD n = 105, LUSC n = 97, PAAD n = 40, SKCM n = 29, STAD n = 53, UCEC = 104) and high-risk patient cases (BLCA n = 93, BRCA n = 223,COADREAD n = 80, HNSC n = 103, KIRC n = 80, KIRP n = 63, LGG n = 68, LIHC n = 84, LUAD n = 89, LUSC n = 103, PAAD n = 40, SKCM n = 55, STAD n = 78,UCEC = 125) across 14 cancer types. For each patient, the top 1% of scored high-attention regions (512 3 512 403 image patches) were segmented andanalyzed for tumor and immune cell presence. Image patches with high tumor-immune co-localization were indicated as positive for TIL presence (and negativeotherwise). Across all patients, the fraction of high-attention patches containing TIL presence was computed and visualized in the boxplots. A two-sample t test

was computed for each cancer type to test the if the means of the TIL fraction distributions of low- and high-risk patients had a statistically significant difference(*p < 0.05). Boxes indicate quartile values and whiskers extend to data points within 1.53 the interquartile range.

图7. 患者风险组中的TIL定量分析

在14种癌症类型中预测低风险(BLCA n = 90,BRCA n = 220,COADREAD n = 74,HNSC n = 96,KIRC n = 80,KIRP n = 36,LGG n = 133,LIHC n = 85,LUAD n = 105,LUSC n = 97,PAAD n = 40,SKCM n = 29,STAD n = 53,UCEC n = 104)和高风险(BLCA n = 93,BRCA n = 223,COADREAD n = 80,HNSC n = 103,KIRC n = 80,KIRP n = 63,LGG n = 68,LIHC n = 84,LUAD n = 89,LUSC n = 103,PAAD n = 40,SKCM n = 55,STAD n = 78,UCEC n = 125)患者病例中TIL的定量分析。对于每个患者,对评分的高注意力区域(512 * 512像素图像块)的前1%进行分割并进行肿瘤和免疫细胞存在性分析。具有高肿瘤-免疫共定位的图像块被标记为TIL存在(否则为阴性)。在所有患者中,计算并可视化了包含TIL存在的高注意力区域的比例。对于每种癌症类型,进行了双样本t检验,以测试低风险和高风险患者的TIL比例分布的均值是否存在统计学显著差异(*p < 0.05)。方框表示四分位数值,而横线则延伸至相对于四分位距1.53倍的数据点




【数据分析面试】44.分析零售客户群体(Python 集合Set的用法)

题目 假设你是一家在线零售商的数据库管理员&#xff0c;需要分析两类客户的数据。一个集合 purchased_customers 包含在最近一次促销活动中购买了商品的客户ID&#xff0c;另一个集合 newsletter_subscribers 包含订阅了新闻通讯的客户ID。编写一个函数 analyze_customers&am…


目录 再谈构造函数 构造函数体赋值 初始化列表 explicit关键字 static成员 概念 特性 友元 友元函数 友元类 内部类 概念 特性 匿名对象 再次理解类和对象 再谈构造函数 构造函数体赋值 在创建对象时&#xff0c;编译器会通过调用构造函数&#xff0c;给对象中的各个成员…


介绍 ECharts 是一个强大的&#xff0c;基于 JavaScript 的开源数据可视化库&#xff0c;适用于创建多种类型的图表&#xff0c;满足广泛的业务需求。它由百度团队开发并维护&#xff0c;后来捐赠给了 Apache 软件基金会&#xff0c;并已在2021年从孵化项目毕业&#xff0c;成…


一、矩阵问题基础 遍历: for i in range(len(matrix)): for j in range(len(matrix[0]): while 倒序遍历: for i in range(right,left,-1) 临时存储:temp w,h:len(matrix[0])-1 len(matrix)-1 left,right,top,bottom:0 len(matrix[0])-1 0 len(matrix)-1 索引: width = le…


“工作时长”&#xff0c;是选择公司的一个非常重要的参考指标。 我们在选择一个公司的时候&#xff0c;除了需要关注总收入package 以外&#xff0c;还需要考虑这家公司的加班时长是否人性化。 我们的工作时长是周工作小时数。法定工作时间是40小时(955)。大小周通常折算为周…


作者 | 郭炜 编辑 | Debra Chen 在当今的商业环境中&#xff0c;大数据的管理和应用已经成为企业决策和运营的核心组成部分。然而&#xff0c;随着数据量的爆炸性增长&#xff0c;如何有效利用这些数据成为了一个普遍的挑战。 本文将探讨大数据架构、大模型的集成&#xff0…


文章目录 &#x1f6a9;前言1、栈的概念2、栈的实现框架3、栈的代码实现3.1、栈的初始化和销毁3.2、入栈\出栈\返回栈顶元素\元素个数\判空3.3、栈定义注意事项 4、栈的应用实例——《括号匹配问题》 &#x1f6a9;前言 前面记录了关于顺序表和链表的数据结构&#xff0c;这一篇…


进入后云计算时代&#xff0c;云原生正在成为企业数字化转型的潮流和加速器。云原生安全相关的公司雨后春笋般建立起来&#xff0c;各个大云厂商也积极建立自己云原生的安全能力&#xff0c;保护云上客户的资产。 与之相对的&#xff0c;黑产组织为了牟利&#xff0c;也在不断…


设计电路谁都会&#xff0c;但是设计低功耗电路&#xff0c;降低芯片功耗却是难题 - 哔哩哔哩 (bilibili.com) 一个产品的低功耗设计&#xff0c;并不仅仅只是采用一个低功耗的MCU就能解决的问题。产品的低功耗&#xff0c;不久取决于MCU的低功耗&#xff0c;也取决于低功耗的…


#include "MainWindow.h" #include "ui_MainWindow.h"MainWindow::MainWindow(QWidget *parent):

别再找了!吐血整理ChatGPT 3.5/4.0新手使用手册

引领科技潮流的ChatGPT早已名声在外&#xff0c;如今获取ChatGPT已变得触手可及&#xff0c;但很多人还多次提问如何使用chatgpt&#xff0c;为了避免陷入误区&#xff0c;本文旨在为广大ChatGPT爱好者提供一份实用的指南。 因此&#xff0c;帮助大家更好地掌握其使用技巧&…


题目&#xff1a; 给定一个长度为 n 的整数数组 height 。有 n 条垂线&#xff0c;第 i 条线的两个端点是 (i, 0) 和 (i, height[i]) 。 找出其中的两条线&#xff0c;使得它们与 x 轴共同构成的容器可以容纳最多的水。 返回容器可以储存的最大水量。 说明&#xff1a;你不能倾…


一、命令的格式 1.1 打开终端的方式 ubuntu中的命令基本都是在终端执行的 打开终端的方式&#xff1a; 第一种方法&#xff1a;在ubuntu桌面中鼠标右键选择“打开终端” 第二种方法&#xff1a;使用快捷键ctrl alt t 1.2 终端提示符 stuqfedu:~$ 对于这个提示符 stu&…


无需魔法&#xff0c;直接在PS中完成图生图、局部重绘、线稿上色、无损放大、扩图等操作。无论你是Windows还是Mac用户&#xff0c;都能轻松驾驭这款强大的AI绘图工具&#xff0c;这款PSAI插件让你的设计工作直接起飞&#xff01; 在之前的分享中&#xff0c;我为大家推荐过两…


私有化大模型的三种方式 随着我们使用大模型的深入呢&#xff0c;我们会发现这样一个现象&#xff0c;我们正常情况下问大模型的问题&#xff0c;会得到一个非常普适的回答&#xff0c;就是大模型会根据自己的训练的这个过往的一些知识的积累&#xff0c;然后告诉我们他认为最…


externals 配置项主要用于防止将某些 import 的包&#xff08;package&#xff09;打包到 bundle 中&#xff0c;而是在运行时&#xff08;runtime&#xff09;再从外部获取这些扩展依赖&#xff08;external dependencies&#xff09;。这样做的主要目的是为了解决打包文件过大…


我是王路飞。 抖店的退货率高&#xff0c;怎么解决呢&#xff1f; 当然是看情况&#xff0c;然后换产品、换厂家啊&#xff0c;不然换店铺吗&#xff1f; 要知道&#xff0c;做电商&#xff0c;产品可以死&#xff0c;店铺不能死&#xff0c;不然做起来太累了&#xff0c;也…


在数字化浪潮席卷全球的今天&#xff0c;智慧工厂已不再是科幻小说中的概念&#xff0c;而是成为了现代工业发展的新引擎。 智慧工厂可视化大屏&#xff0c;不仅仅是一块显示屏&#xff0c;更是工厂运行的“大脑”。通过这块屏幕&#xff0c;我们可以实时掌握工厂的每一个角落、…

(规格参考)ADP5360ACBZ-1-R7 电量计 电池管理IC,ADP5072ACBZ 双通道直流开关稳压器,ADL5903ACPZN 射频检测器

1、ADP5360ACBZ-1-R7&#xff1a;具有超低功耗电量计、电池保护功能的先进电池管理PMIC 功能&#xff1a;电池保护 电池化学成份&#xff1a;锂离子/聚合物 电池数&#xff1a;1 故障保护&#xff1a;超温&#xff0c;过压 接口&#xff1a;I2C 工作温度&#xff1a;-40C ~ 85…

Java 插入数据到Elasticsearch中进行各种类型文档的内容检索

源码下载&#xff1a;链接&#xff1a;https://pan.baidu.com/s/1D3yszkTzjwQz0vFRozQl2g?pwdz6kb 提取码&#xff1a;z6kb 实现思路 1.搭建一个新的springboot项目&#xff0c;不会的请看我这篇博客&#xff1a;springboot项目搭建 2.添加maven依赖 <dependency><…