2025 , Vol. 22 >Issue 11: 1055 - 1061
DOI: https://doi.org/10.3877/cma.j.issn.1672-6448.2025.11.009
基于DeepSeek大语言模型的胃癌和直肠癌超声报告结构化及T分期自动评估研究
通信作者:
王璐,Email:wanglu1@scszlyy.org.cnCopy editor: 汪荣
收稿日期: 2025-08-25
网络出版日期: 2026-02-12
基金资助
国家重点研发计划(2019YFE0196700)
国家自然科学基金(82272015)
四川省区域创新合作项目(2024YFHZ0140)
版权
Utility of DeepSeek large language models for structured ultrasound reporting and automated tumor staging in gastric and rectal cancer
Corresponding author:
Wang Lu, Email: wanglu1@scszlyy.org.cnReceived date: 2025-08-25
Online published: 2026-02-12
Copyright
探讨DeepSeek大语言模型在胃癌和直肠癌超声报告结构化及T分期自动评估中的应用价值。
本研究纳入四川省肿瘤医院2023年1月至2024年12月进行的胃癌和直肠癌超声检查报告共121份。由资深超声医师团队制定胃癌和直肠癌超声报告结构化模板,使用DeepSeek R1和V3模型进行结构化信息提取和T分期评估。采用召回率、精确率和F1分数评估结构化报告生成的性能,并以准确性评估T分期性能。邀请3位医师对比评估DeepSeek生成的报告与原始报告,评价其在审阅效率和临床易用性方面的表现。
DeepSeek R1与V3模型在结构化信息提取方面召回率、精确率和F1分数均高于0.9,二者差异均无统计学意义(P均>0.05)。在T分期评估中,采用推理模式的DeepSeek R1模型准确性最高,达到76.86%,显著优于DeepSeek V3模型的59.50%,二者差异具有统计学意义(χ2=8.51,P<0.05)。与审阅原始报告所需的平均时间[(60.96±6.11)s/份]相比,审阅DeepSeek R1[(18.12±4.52)s/份](t=60.38;P<0.001)和DeepSeek V3[(17.15±2.60)s/份](t=71.98;P<0.001)生成的结构化报告所需时间缩短。5分李克特量表评分结果显示,原始报告的评分为3(3,3)分,而DeepSeek R1和V3报告的评分分别为1(1,2)分(Z=-9.72;P<0.001)和1(1,2)分(Z=-9.95;P<0.001),差异具有统计学意义。
DeepSeek大语言模型,特别是R1版本,可有效从胃癌和直肠癌超声报告中提取结构化信息,并在T分期评估方面展现出较高的准确性,其生成的报告有助于提高审阅效率,并具有辅助临床决策的潜力。
张振奇 , 卢漫 , 齐艺涵 , 庄敏 , 胡紫玥 , 王璐 . 基于DeepSeek大语言模型的胃癌和直肠癌超声报告结构化及T分期自动评估研究[J]. 中华医学超声杂志(电子版), 2025 , 22(11) : 1055 -1061 . DOI: 10.3877/cma.j.issn.1672-6448.2025.11.009
To investigate the utility of the DeepSeek large language model (LLM) in the structured generation of ultrasound reports and the automatic assessment of T-staging for gastric and rectal cancer.
A total of 121 ultrasound examination reports for gastric and rectal cancer, collected from Sichuan Cancer Hospital between January 2023 and December 2024, were included in this study. A structured template for gastric and rectal cancer ultrasound reports was developed by a team of senior sonographers. The DeepSeek R1 and V3 models were employed to extract structured information and assess T-staging. The performance of structured report generation was evaluated using recall, precision, and F1 score, while T-staging performance was assessed based on accuracy. Three physicians were invited to compare the reports generated by DeepSeek with the original reports to evaluate review efficiency and clinical usability.
Regarding structured information extraction, both DeepSeek R1 and V3 models achieved recall, precision, and F1 scores exceeding 0.9, with no statistically significant differences between the two (P>0.05). In T-staging assessment, the DeepSeek R1 model (utilizing reasoning mode) achieved the highest accuracy of 76.86%, which was significantly superior to the 59.50% achieved by the DeepSeek V3 model (χ2=8.51, P<0.05). Compared to the average time required to review original reports [(60.96±6.11) s/report], the review time for structured reports generated by DeepSeek R1 [(18.12±4.52) s/report] (t=60.38; P<0.001) and DeepSeek V3 [(17.15±2.60) s/report] (t=71.98; P<0.001) was significantly shortened. The 5-point Likert scale evaluation showed that the score for the original reports was 3 (3, 3), while the scores for the DeepSeek R1 and V3 reports were 1 (1, 2) (Z=-9.72; P<0.001) and 1 (1, 2) (Z=-9.95; P<0.001), respectively, indicating a statistically significant difference.
The DeepSeek large language models, particularly the R1 version, can effectively extract structured information from gastric and rectal cancer ultrasound reports and demonstrates high accuracy in T-staging assessment. The generated reports contribute to improved review efficiency and possess the potential to assist in clinical decision-making.

表1 胃肠超声报告的一般情况 [n=121,例(%)] |
| 项目 | 数值 | 项目 | 数值 |
|---|---|---|---|
| 报告类别 | 71~80岁 | 25(20.7) | |
| 胃癌超声报告 | 73(60.3) | ≥81岁 | 3(2.5) |
| 直肠癌超声报告 | 48(39.7) | T分期 | |
| 性别 | T0 | 3(2.5) | |
| 男 | 89(73.6) | T1a | 2(1.6) |
| 女 | 32(26.4) | T1b | 1(0.8) |
| 年龄 | T2 | 31(25.6) | |
| ≤40岁 | 4(3.3) | T3 | 13(10.7) |
| 41~50岁 | 11(9.1) | T4a | 65(53.7) |
| 51~60岁 | 38(31.4) | T4b | 6(5.0) |
| 61~70岁 | 40(33.1) |
表2 DeepSeek R1与V3模型生成的结构化报告信息提取性能评估四格表(份) |
| 指标 | TP | FP | FN | TN | 总数 |
|---|---|---|---|---|---|
| 检查部位(R1模型) | 121 | 0 | 0 | 0 | 121 |
| 检查部位(V3模型) | 121 | 0 | 0 | 0 | 121 |
| 黏膜层情况(R1模型) | 121 | 0 | 0 | 0 | 121 |
| 黏膜层情况(V3模型) | 121 | 0 | 0 | 0 | 121 |
| 肌层情况(R1模型) | 118 | 0 | 3 | 0 | 121 |
| 肌层情况(V3模型) | 120 | 0 | 1 | 0 | 121 |
| 浆膜层情况(R1模型) | 118 | 0 | 3 | 0 | 121 |
| 浆膜层情况(V3模型) | 118 | 0 | 3 | 0 | 121 |
| 肠壁或胃壁厚度(R1模型) | 121 | 0 | 0 | 0 | 121 |
| 肠壁或胃壁厚度(V3模型) | 121 | 0 | 0 | 0 | 121 |
| 蠕动情况(R1模型) | 121 | 0 | 0 | 0 | 121 |
| 蠕动情况(V3模型) | 121 | 0 | 0 | 0 | 121 |
| 占位性病变(R1模型) | 121 | 0 | 0 | 0 | 121 |
| 占位性病变(V3模型) | 120 | 0 | 1 | 0 | 121 |
| 淋巴结(R1模型) | 121 | 0 | 0 | 0 | 121 |
| 淋巴结(V3模型) | 121 | 0 | 0 | 0 | 121 |
| 腹水情况(R1模型) | 121 | 0 | 0 | 0 | 121 |
| 腹水情况(V3模型) | 121 | 0 | 0 | 0 | 121 |
| 其他发现(R1模型) | 121 | 0 | 0 | 0 | 121 |
| 其他发现(V3模型) | 121 | 0 | 0 | 0 | 121 |
注:TP为真阳性;FP为假阳性;FN为假阴性;TN为真阴性 |
表3 DeepSeek R1与V3模型结构化信息提取的性能比较 |
| 指标 | 召回率 | 精确率 | F1分数 | 真阳性提取准确性 | χ2值 | P值 | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| R1模型 | V3模型 | R1模型 | V3模型 | R1模型 | V3模型 | R1模型 | V3模型 | |||
| 检查部位 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | N/A | >0.05 |
| 黏膜层情况 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | N/A | >0.05 |
| 肌层情况 | 0.975 | 0.992 | 1.000 | 1.000 | 0.987 | 0.996 | 1.000 | 1.000 | 0.25 | >0.05 |
| 浆膜层情况 | 0.975 | 0.975 | 1.000 | 1.000 | 0.987 | 0.987 | 1.000 | 1.000 | 0.25 | >0.05 |
| 肠壁或胃壁厚度 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | N/A | >0.05 |
| 蠕动情况 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | N/A | >0.05 |
| 占位性病变 | 1.000 | 0.992 | 1.000 | 1.000 | 1.000 | 0.996 | 0.936 | 0.940 | 0.00 | >0.05 |
| 淋巴结 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | N/A | >0.05 |
| 腹水 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | N/A | >0.05 |
| 其他发现 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | N/A | >0.05 |
注:P值为 R1 与 V3 在该要素逐例正确/错误(配对二分类)差异的检验结果,采用 McNemar 检验,χ2为连续性校正McNemar统计量;当不一致对数较少时,P值采用精确McNemar检验(双侧);N/A表示2组模型在该指标上提取结果完全一致(如均为1),无差异,故无法计算统计量 |
张振奇, 卢漫, 齐艺涵, 等. 基于DeepSeek大语言模型的胃癌和直肠癌超声报告结构化及T分期自动评估研究[J/OL]. 中华医学超声杂志(电子版), 2025, 22(11): 1055-1061.
| 1 |
袁坤山, 王如蒙, 张淑欣, 等.口服胃肠超声助显剂的研究进展[J/OL].中华医学超声杂志(电子版), 2020, 17(6): 587-590.
|
| 2 |
宋勇, 张伟, 李锐, 等.PDCA循环法在降低超声测量数值错误报告中的应用价值[J].临床超声医学杂志, 2023, 25(8): 650-653.
|
| 3 |
|
| 4 |
|
| 5 |
|
| 6 |
|
| 7 |
秦赛梅, 文琼, 段依恋, 等.对比通义千问2.5与GPT-4o模型生成的甲状腺超声结构化报告[J].中国医学影像技术, 2025, 41(3): 409-413.
|
| 8 |
|
| 9 |
|
| 10 |
|
| 11 |
|
| 12 |
张梅芳, 谭莹, 朱巧珍, 等.早孕期胎儿头臀长正中矢状切面超声图像的人工智能质控研究[J/OL].中华医学超声杂志(电子版), 2023, 20(9): 945-950.
|
| 13 |
朱巧珍, 谭莹, 张梅芳, 等.妊娠早期胎儿心脏人工智能质控模型的研究与应用[J].中华超声影像学杂志, 2023, 32(11): 952-958.
|
| 14 |
孙舒涵, 陈雅静, 宗晴晴, 等.基于超声的深度学习列线图预测乳腺癌新辅助化疗后腋窝淋巴结状态的研究[J/OL].中华医学超声杂志(电子版), 2025, 22(2): 97-105.
|
| 15 |
|
| 16 |
|
| 17 |
|
| 18 |
谭浩, 王力, 王军永, 等.技术与社会的视角探析ChatGPT对医学的影响[J].医学与哲学, 2024, 45(5): 15-20.
|
| 19 |
闫温馨, 刘珏, 梁万年.DeepSeek赋能全科医学: 潜在应用与展望[J].中国全科医学, 2025, 28(17): 2065-2069.
|
| 20 |
刘泽垣, 王鹏江, 宋晓斌, 等.大型语言模型的幻觉问题研究综述[J].软件学报, 2025, 36(3): 1152-1185.
|
/
| 〈 |
|
〉 |