陈锦华,黄晓婷,金佳颖,丁伯洋,朱冉,李文艳,刘芬菊,俞家华.基于随机森林的电离辐射诱导DNA双链断裂分类模型的构建与应用[J].中华放射医学与防护杂志,2021,41(6):413-417
基于随机森林的电离辐射诱导DNA双链断裂分类模型的构建与应用
Construction and application of a random forest-based classification model for DNA double-strand break induced by ionizing radiation
投稿时间:2020-10-28  
DOI:10.3760/cma.j.issn.0254-5098.2021.06.003
中文关键词:  电离辐射  DNA双链断裂  随机森林  分类模型  表观遗传学
英文关键词:Ionizing radiation  DNA double-strand break  Random forest  Classification model  Epigenetics
基金项目:国家自然科学基金(81872548)
作者单位E-mail
陈锦华 苏州大学放射医学与防护学院 放射医学与辐射防护国家重点实验室 江苏省高校放射医学协同创新中心 215123  
黄晓婷 苏州大学放射医学与防护学院 放射医学与辐射防护国家重点实验室 江苏省高校放射医学协同创新中心 215123  
金佳颖 苏州大学放射医学与防护学院 放射医学与辐射防护国家重点实验室 江苏省高校放射医学协同创新中心 215123  
丁伯洋 苏州大学放射医学与防护学院 放射医学与辐射防护国家重点实验室 江苏省高校放射医学协同创新中心 215123  
朱冉 苏州大学放射医学与防护学院 放射医学与辐射防护国家重点实验室 江苏省高校放射医学协同创新中心 215123  
李文艳 苏州大学放射医学与防护学院 放射医学与辐射防护国家重点实验室 江苏省高校放射医学协同创新中心 215123  
刘芬菊 苏州大学放射医学与防护学院 放射医学与辐射防护国家重点实验室 江苏省高校放射医学协同创新中心 215123  
俞家华 苏州大学放射医学与防护学院 放射医学与辐射防护国家重点实验室 江苏省高校放射医学协同创新中心 215123 yujiahua@suda.edu.cn 
摘要点击次数: 2201
全文下载次数: 929
中文摘要:
      目的 构建预测电离辐射诱导DNA双链断裂(DSB)水平的随机森林分类模型,初步研究DSB在基因组中的分布规律。方法 将GRCh38参考基因组分为50 kb的片段,根据MCF-7细胞的测序数据把片段分为电离辐射诱导的DSB低水平和高水平区域,以8种表观遗传学特征作为输入,随机将数据集的2/3列为训练集,1/3列为测试集,构建含100棵决策树的随机森林分类模型。分析分类模型中表观遗传学的特征重要性,展示这些标记在不同DSB水平区域的富集差异。结果 随机森林分类模型在测试集上预测的准确率为99.4%,精准率为98.9%,召回率为99.9%,受试者操作特征曲线下面积为0.994。8个特征中H3K36me3和DNase标记的重要性最高,富集分析表明DSB高水平区域的这两类标记明显高于DSB低水平区域。结论 以表观遗传学数据作为特征输入,随机森林分类模型可在50 kb基因组区域上准确预测电离辐射诱导的DSB水平,分析表明这些DSB可能主要分布在基因组中转录活跃的部位。
英文摘要:
      Objective To construct a random forest classification model of DNA double strand breaks (DSB) induced by ionizing radiation and investigate the genome-wide distribution of DSB. Methods The GRCh38 reference genome was divided into 50 kilobase fragments. Then these genomic fragments were separated into low-level or high-level regions of ionizing radiation-induced DSB according to the sequencing data of MCF-7 cells. The data of eight epigenetic features were used as input. Two thirds of the data were randomly assigned to the training set, and the rest of the data was assigned to the test set. A random forest classification model with 100 decision trees was constructed. The importance of epigenetic features in the classification model was analyzed and displayed. Results The accuracy score of the random forest classification model on the test set was 99.4%, the precision score was 98.9% and the recall score was 99.9%. The area under the receiver operating characteristic curve was 0.994. Among the eight epigenetic features, H3K36me3 and DNase markers were the most important variables. The enrichments of the two markers in DSB high-level regions were much higher than those in DSB low-level regions. Conclusions The random forest classification model could precisely predict the genome-wide levels of DSB induced by ionizing radiation in the 50 kilobase window based on epigenetic features. Analysis revealed that these DSB might primarily distribute in the actively transcribed sites in the genome.
HTML  查看全文  查看/发表评论  下载PDF阅读器
关闭