Abstract
Background: Non-coding variants constitute a major fraction of the human genome and are increasingly implicated in rare diseases, yet their pathogenicity remains challenging to predict. Undiagnosed rare disease patients often harbor variants in non-coding regions missed by conventional sequencing pipelines. Methods: We developed a deep learning framework, ncPatho, integrating convolutional neural networks (CNNs) and transformer-based architectures to predict the pathogenicity of non-coding variants. The model was trained on a curated set of pathogenic and benign non-coding variants from ClinVar and gnomAD, incorporating genomic sequence context, conservation scores, and regulatory annotations. We applied ncPatho to a cohort of 200 undiagnosed rare disease patients from the Undiagnosed Diseases Network (UDN). Results: ncPatho achieved an area under the receiver operating characteristic curve (AUC) of 0.94 on held-out test data, outperforming existing methods such as CADD and DANN. In the undiagnosed cohort, we prioritized 47 candidate non-coding variants, of which 12 were validated through functional assays and RNA sequencing, leading to a molecular diagnosis in 8 patients (4% diagnostic yield). Key predictions involved deep intronic variants affecting splicing and promoter variants in known disease genes. Conclusions: Deep learning models tailored for non-coding variation can significantly enhance diagnostic rates in undiagnosed rare diseases, complementing existing genomic analysis tools. Our approach demonstrates the feasibility of integrating multi-omics data with computational predictions to resolve previously unsolved cases.