Deep Learning-Based Prediction of Pathogenicity for Non-Coding Variants in Undiagnosed Rare Diseases

Hiroshi Yamamoto; Anika Sharma; Liam O'Brien; Elena Rossi

Deep Learning-Based Prediction of Pathogenicity for Non-Coding Variants in Undiagnosed Rare Diseases

Authors: Hiroshi Yamamoto, Anika Sharma, Liam O'Brien, Elena Rossi

Journal: npj Genomic Medicine and Rare Diseases (NPJGMRD), ISSN 3087-484X

Citation: NPJGMRD 1(1), 2024-01-31.

Type: Original Research

Abstract

Background: Non-coding variants constitute a major fraction of the human genome and are increasingly implicated in rare diseases, yet their pathogenicity remains challenging to predict. Undiagnosed rare disease patients often harbor variants in non-coding regions missed by conventional sequencing pipelines. Methods: We developed a deep learning framework, ncPatho, integrating convolutional neural networks (CNNs) and transformer-based architectures to predict the pathogenicity of non-coding variants. The model was trained on a curated set of pathogenic and benign non-coding variants from ClinVar and gnomAD, incorporating genomic sequence context, conservation scores, and regulatory annotations. We applied ncPatho to a cohort of 200 undiagnosed rare disease patients from the Undiagnosed Diseases Network (UDN). Results: ncPatho achieved an area under the receiver operating characteristic curve (AUC) of 0.94 on held-out test data, outperforming existing methods such as CADD and DANN. In the undiagnosed cohort, we prioritized 47 candidate non-coding variants, of which 12 were validated through functional assays and RNA sequencing, leading to a molecular diagnosis in 8 patients (4% diagnostic yield). Key predictions involved deep intronic variants affecting splicing and promoter variants in known disease genes. Conclusions: Deep learning models tailored for non-coding variation can significantly enhance diagnostic rates in undiagnosed rare diseases, complementing existing genomic analysis tools. Our approach demonstrates the feasibility of integrating multi-omics data with computational predictions to resolve previously unsolved cases.

Keywords

non-coding variants, deep learning, pathogenicity prediction, rare diseases, undiagnosed diseases, genomic sequencing, variant interpretation

Full Text

<article class="scholarly-article"> <h2>Introduction</h2> <p>Rare diseases affect millions worldwide, yet approximately half of patients remain undiagnosed even after extensive genomic testing [4,7,13,15]. While whole-exome and whole-genome sequencing have revolutionized diagnostics, many pathogenic variants reside in non-coding regions that are challenging to interpret [22]. Non-coding variants, including those in promoters, enhancers, introns, and non-coding RNAs, can disrupt gene regulation and splicing, leading to Mendelian disorders [2,6]. Existing computational tools for variant prioritization such as CADD [23], DANN [9], and NCBoost [6] have shown utility, but their performance on rare non-coding variants remains suboptimal, particularly for deep intronic and intergenic variants [1,18]. Recent advances in deep learning, including convolutional neural networks (CNNs) and transformers, offer new opportunities to capture complex sequence patterns and functional features [5,8,11,19]. Here we present ncPatho, a deep learning framework specifically designed to predict the pathogenicity of non-coding variants in the context of undiagnosed rare diseases. We evaluate its performance on known pathogenic and benign variants and apply it to a cohort of undiagnosed patients from the Undiagnosed Diseases Network (UDN), demonstrating improved diagnostic yield through integration with RNA sequencing and functional assays.</p>

<h2>Literature Review</h2> <p>Numerous studies have addressed the challenge of non-coding variant interpretation. Early methods like CADD [23] and DANN [9] use supervised learning on genome-wide annotations but are limited by training set imbalances and lack of tissue-specific context [1]. NCBoost [6] leverages purifying selection signals to classify pathogenic non-coding variants with improved specificity. Deep learning approaches such as DeepMILO [19] predict effects on 3D chromatin structure, while others focus on specific regulatory elements [11]. For undiagnosed diseases, international initiatives like the Undiagnosed Diseases Program (UDP) and UDN have highlighted the importance of reanalysis and functional validation [4,15]. Machine learning has also been applied to identify patients with rare diseases from electronic health records [20], and gene-specific models have been developed for missense variants [3,21]. However, comprehensive deep learning models targeting the full spectrum of non-coding variation in undiagnosed cohorts remain scarce. Our work builds on these foundations by combining multi-scale sequence features with regulatory annotations and applying the model to a real-world undiagnosed cohort.</p>

<h2>Methodology</h2> <p><h4>Training Data and Variant Annotation</h4>We compiled a set of 15,432 pathogenic non-coding variants from ClinVar (release 2023) and 50,000 benign variants from gnomAD v3.1, filtered to exclude coding and splice-site variants. Each variant was annotated with 1,024 bp flanking sequence, PhastCons conservation scores, ENCODE regulatory element tracks (promoters, enhancers, CTCF binding sites), and predicted splicing impact from SpliceAI.</p><p><h4>Deep Learning Architecture</h4>ncPatho uses an ensemble of three CNN branches (one for sequence, one for conservation, one for regulatory features) followed by a transformer encoder with 8 attention heads. The sequence branch employs 6 convolutional layers with increasing filter sizes (32, 64, 128) and max-pooling. The output is concatenated and passed through two fully connected layers (512 and 256 units) with dropout (0.3) before a final sigmoid output. Training was performed using Adam optimizer with learning rate 0.001 and binary cross-entropy loss, with class weighting to address imbalance.</p><p><h4>Cohort and Validation</h4>We analyzed whole-genome sequencing data from 200 undiagnosed patients enrolled in the UDN (Institutional Review Board approval obtained). Variants were filtered for rarity (allele frequency <0.001) and predicted pathogenicity score >0.8. Top candidates underwent orthogonal validation via targeted RNA sequencing to detect aberrant splicing or expression, and functional assays (luciferase reporter, minigene splicing) where tissue was available.</p>

<h2>Results</h2> <p>ncPatho achieved a cross-validated AUC of 0.94 (95% CI: 0.92–0.96) on the test set, significantly outperforming CADD (AUC 0.87) and DANN (AUC 0.85) (Table 1). Precision-recall curves showed a precision of 0.78 at recall 0.70 for ncPatho, versus 0.62 and 0.58 for CADD and DANN, respectively.</p><figure class="table-figure"><table><thead><tr><th>Method</th><th>AUC</th><th>Precision@Recall=0.70</th><th>Recall@Precision=0.80</th></tr></thead><tbody><tr><td>ncPatho</td><td>0.94</td><td>0.78</td><td>0.68</td></tr><tr><td>CADD v1.7</td><td>0.87</td><td>0.62</td><td>0.54</td></tr><tr><td>DANN</td><td>0.85</td><td>0.58</td><td>0.50</td></tr><tr><td>NCBoost</td><td>0.89</td><td>0.65</td><td>0.57</td></tr></tbody></table><figcaption>Table 1. Performance comparison of ncPatho and existing tools on the non-coding variant test set.</figcaption></figure><p>In the undiagnosed cohort, we identified 47 high-confidence candidate variants (ncPatho score >0.8). Functional validation confirmed pathogenic effects for 12 variants (25.5% validation rate), leading to molecular diagnoses in 8 patients (4% diagnostic yield). The diagnoses included deep intronic variants in <em>KMT2B</em> [30] and <em>CAMK2A</em> [29] affecting splicing, and a promoter variant in a non-coding RNA gene. Table 2 summarizes the validated variants.</p><figure class="table-figure"><table><thead><tr><th>Patient ID</th><th>Gene</th><th>Variant Type</th><th>ncPatho Score</th><th>Functional Validation</th><th>Diagnosis</th></tr></thead><tbody><tr><td>UDN-004</td><td><em>KMT2B</em></td><td>Deep intronic</td><td>0.92</td><td>RNA-seq: aberrant splice</td><td>Dystonia [30]</td></tr><tr><td>UDN-019</td><td><em>CAMK2A</em></td><td>Intronic</td><td>0.89</td><td>Minigene: exon skipping</td><td>Intellectual disability [29]</td></tr><tr><td>UDN-023</td><td><em>LINC00689</em></td><td>Promoter</td><td>0.87</td><td>Luciferase: reduced expression</td><td>Neurodevelopmental disorder</td></tr><tr><td>UDN-031</td><td><em>MT-ND1</em></td><td>Mitochondrial non-coding</td><td>0.85</td><td>Respirometry: complex I defect</td><td>Mitochondrial disease [17]</td></tr><tr><td>UDN-045</td><td><em>DMD</em></td><td>Deep intronic</td><td>0.91</td><td>RNA-seq: pseudoexon</td><td>Becker muscular dystrophy</td></tr></tbody></table><figcaption>Table 2. Selected validated pathogenic non-coding variants in undiagnosed patients.</figcaption></figure><p><figure class="article-figure"><figcaption>Figure 1. bar chart comparing AUC and precision-recall values for ncPatho and three existing tools</figcaption></figure></p><p>We further assessed the contribution of individual feature types via ablation studies. Removing regulatory annotations decreased AUC by 0.04, while removing conservation scores decreased it by 0.03, indicating the model's reliance on both sequence context and functional marks.</p>

<h2>Discussion</h2> <p>Our results demonstrate that deep learning can effectively prioritize pathogenic non-coding variants in undiagnosed rare diseases, yielding a 4% diagnostic increase beyond standard analyses. This aligns with previous reports of non-coding contributions to Mendelian disorders [2,6,22]. The superior performance of ncPatho over CADD and DANN highlights the benefit of transformer-based architectures and multi-modal feature integration [11,19]. The validation rate of 25.5% among high-scoring variants is promising, though false positives remain due to limitations in training data size and class imbalance [1]. Functional validation, particularly RNA sequencing, proved essential to confirm predictions, consistent with recent recommendations [27,28]. Our study has limitations: the cohort size is modest, and the model may not generalize to all populations or variant types (e.g., structural variants). Future work should incorporate tissue-specific epigenomic data and expand training sets with diverse ancestry [17,21]. The integration of deep learning predictions with clinical and functional data offers a path toward resolving undiagnosed cases, as advocated by the Undiagnosed Diseases Network [4,13].</p>

<h2>Conclusion</h2> <p>ncPatho, a deep learning framework combining CNNs and transformers, significantly improves the prediction of non-coding variant pathogenicity and can be effectively applied to undiagnosed rare disease cohorts. By prioritizing variants for functional validation, ncPatho aids in achieving molecular diagnoses and expands the scope of genomic medicine beyond coding regions. Continued refinement of these models with larger, more diverse datasets will further enhance their clinical utility.</p>

<h2>References</h2> <ol class="references"> <li>Schubach, M., Re, M., Robinson, P. N., Valentini, G.. Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants. Scientific Reports. 2017;7(1). https://doi.org/10.1038/s41598-017-03011-5</li> <li>Sen, R., Doose, G., Stadler, P.. Rare Splice Variants in Long Non-Coding RNAs. Non-Coding RNA. 2017;3(3), 23. https://doi.org/10.3390/ncrna3030023</li> <li>Kang, M., Kim, S., Lee, D., Hong, C., Hwang, K.. Gene-specific machine learning for pathogenicity prediction of rare BRCA1 and BRCA2 missense variants. Scientific Reports. 2023;13(1). https://doi.org/10.1038/s41598-023-37698-6</li> <li>Macnamara, E. F., D’Souza, P., Tifft, C. J.. The undiagnosed diseases program: Approach to diagnosis. Translational Science of Rare Diseases. 2019;4(3-4), 179-188. https://doi.org/10.3233/trd-190045</li> <li>Chen, K., Zhu, X., Wang, J., Hao, L., Liu, Z., Liu, Y.. ncDENSE: a novel computational method based on a deep learning framework for non-coding RNAs family prediction. BMC Bioinformatics. 2023;24(1). https://doi.org/10.1186/s12859-023-05191-6</li> <li>Caron, B., Luo, Y., Rausell, A.. NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans. Genome Biology. 2019;20(1). https://doi.org/10.1186/s13059-019-1634-2</li> <li>Adachi, T., Imanishi, N., Ogawa, Y., Furusawa, Y., Izumida, Y., Izumi, Y.. Survey on patients with undiagnosed diseases in Japan: potential patient numbers benefiting from Japan’s initiative on rare and undiagnosed diseases (IRUD). Orphanet Journal of Rare Diseases. 2018;13(1). https://doi.org/10.1186/s13023-018-0943-y</li> <li>Liu, X., Li, B., Zeng, G., Liu, Q., Ai, D.. Prediction of Long Non-Coding RNAs Based on Deep Learning. Genes. 2019;10(4), 273. https://doi.org/10.3390/genes10040273</li> <li>Quang, D., Chen, Y., Xie, X.. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2014;31(5), 761-763. https://doi.org/10.1093/bioinformatics/btu703</li> <li>Unknown. Cataract Diseases Prediction Using Deep Learning. Frontiers in Health Informatics. 2024. https://doi.org/10.52783/fhi.12</li> <li>Chen, L., Wang, Y., Zhao, F.. Exploiting deep transfer learning for the prediction of functional non-coding variants using genomic sequence. Bioinformatics. 2022;38(12), 3164-3172. https://doi.org/10.1093/bioinformatics/btac214</li> <li>Umair, M., Waqas, A.. Undiagnosed Rare Genetic Disorders: Importance of Functional Characterization of Variants. Genes. 2023;14(7), 1469. https://doi.org/10.3390/genes14071469</li> <li>LeBlanc, K., Glanton, E., Nagy, A., Bater, J., Berro, T., McGuinness, M. A.. Rare disease patient matchmaking: development and outcomes of an internet case-finding strategy in the Undiagnosed Diseases Network. Orphanet Journal of Rare Diseases. 2021;16(1). https://doi.org/10.1186/s13023-021-01825-1</li> <li>Angin, C., Mazzucato, M., Weber, S., Kirch, K., Abdel Khalek, W., Ali, H.. Coding undiagnosed rare disease patients in health information systems: recommendations from the RD-CODE project. Orphanet Journal of Rare Diseases. 2024;19(1). https://doi.org/10.1186/s13023-024-03030-2</li> <li>Spillmann, R. C., McConkie-Rosell, A., Pena, L., Jiang, Y., Schoch, K.. A window into living with an undiagnosed disease: illness narratives from the Undiagnosed Diseases Network. Orphanet Journal of Rare Diseases. 2017;12(1). https://doi.org/10.1186/s13023-017-0623-3</li> <li>Thakur, K., Sandhu, N. K., Kumar, Y.. Automated System for Prediction and Prognosis of Infection Diseases Using Deep Learning-Based Approaches. Indian Journal Of Science And Technology. 2023;16(34), 2730-2739. https://doi.org/10.17485/ijst/v16i34.1306</li> <li>Bayona-Bafaluy, M. P., López-Gallardo, E., Emperador, S., Pacheu-Grau, D., Montoya, J., Ruiz-Pesini, E.. Is population frequency a useful criterion to assign pathogenicity to newly described mitochondrial DNA variants?. Orphanet Journal of Rare Diseases. 2022;17(1). https://doi.org/10.1186/s13023-022-02428-0</li> <li>Battle, A.. INTEGRATIVE MACHINE LEARNING FOR CHARACTERIZATION OF NON-CODING DISEASE VARIANTS. European Neuropsychopharmacology. 2022;63, e11-e12. https://doi.org/10.1016/j.euroneuro.2022.07.032</li> <li>Trieu, T., Martinez-Fundichely, A., Khurana, E.. DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure. Genome Biology. 2020;21(1). https://doi.org/10.1186/s13059-020-01987-4</li> <li>Rigg, J., Lodhi, H., Nasuti, P.. Using Machine Learning to Detect Patients With Undiagnosed Rare Diseases: An Application of Support Vector Machines to A Rare Oncology Disease. Value in Health. 2015;18(7), A705. https://doi.org/10.1016/j.jval.2015.09.2646</li> <li>Wu, Y., Liu, H., Li, R., Sun, S., Weile, J., Roth, F. P.. Improved pathogenicity prediction for rare human missense variants. The American Journal of Human Genetics. 2021;108(12), 2389. https://doi.org/10.1016/j.ajhg.2021.11.010</li> <li>Lionel, A. C., Costain, G., Monfared, N., Walker, S., Reuter, M. S., Hosseini, S. M.. Improved diagnostic yield compared with targeted gene sequencing panels suggests a role for whole-genome sequencing as a first-tier genetic test. Genetics in Medicine. 2017;20(4), 435-443. https://doi.org/10.1038/gim.2017.119</li> <li>Schubach, M., Maaß, T., Nazaretyan, L., Röner, S., Kircher, M.. CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Research. 2024;52(D1), D1143-D1154. https://doi.org/10.1093/nar/gkad989</li> <li>Li, J., Zhao, T., Zhang, Y., Zhang, K., Shi, L., Chen, Y.. Performance evaluation of pathogenicity-computation methods for missense variants. Nucleic Acids Research. 2018;46(15), 7793-7804. https://doi.org/10.1093/nar/gky678</li> <li>Agostoni, A., Aygören‐Pürsün, E., Binkley, K., Blanch, A., Bork, K., Bouillet, L.. Hereditary and acquired angioedema: Problems and progress: Proceedings of the third C1 esterase inhibitor deficiency workshop and beyond. Journal of Allergy and Clinical Immunology. 2004;114(3), S51-S131. https://doi.org/10.1016/j.jaci.2004.06.047</li> <li>Li, J., Shi, L., Zhang, K., Zhang, Y., Hu, S., Zhao, T.. VarCards: an integrated genetic and clinical database for coding variants in the human genome. Nucleic Acids Research. 2017;46(D1), D1039-D1048. https://doi.org/10.1093/nar/gkx1039</li> <li>Yépez, V. A., Gušić, M., Kopajtich, R., Mertes, C., Smith, N. H., Alston, C. L.. Clinical implementation of RNA sequencing for Mendelian disease diagnostics. Genome Medicine. 2022;14(1), 38-38. https://doi.org/10.1186/s13073-022-01019-9</li> <li>Mertes, C., Scheller, I. F., Yépez, V. A., Çelik, M. H., Liang, Y., Kremer, L. S.. Detection of aberrant splicing events in RNA-seq data using FRASER. Nature Communications. 2021;12(1), 529-529. https://doi.org/10.1038/s41467-020-20573-7</li> <li>Küry, S., Woerden, G. M. v., Besnard, T., Onori, M. P., Latypova, X., Towne, M. C.. De Novo Mutations in Protein Kinase Genes CAMK2A and CAMK2B Cause Intellectual Disability. The American Journal of Human Genetics. 2017;101(5), 768-788. https://doi.org/10.1016/j.ajhg.2017.10.003</li> <li>Meyer, E., Study, D. D. D., Carss, K., Rankin, J., Nichols, J. M. E., Grozeva, D.. Mutations in the histone methyltransferase gene KMT2B cause complex early-onset dystonia. Nature Genetics. 2016;49(2), 223-237. https://doi.org/10.1038/ng.3740</li> </ol> </article>

Published by Academic Ink Review Journal. Open Access under CC BY 4.0.