De novo pattern discovery enables robust assessment of functional consequences of non-coding variants.

TitleDe novo pattern discovery enables robust assessment of functional consequences of non-coding variants.
Publication TypeJournal Article
Year of Publication2019
AuthorsYang, H, Chen, R, Wang, Q, Wei, Q, Ji, Y, Zheng, G, Zhong, X, Cox, NJ, Li, B
JournalBioinformatics
Volume35
Issue9
Pagination1453-1460
Date Published2019 May 01
ISSN1367-4811
Abstract

MOTIVATION: Given the complexity of genome regions, prioritize the functional effects of non-coding variants remains a challenge. Although several frameworks have been proposed for the evaluation of the functionality of non-coding variants, most of them used 'black boxes' methods that simplify the task as the pathogenicity/benign classification problem, which ignores the distinct regulatory mechanisms of variants and leads to less desirable performance. In this study, we developed DVAR, an unsupervised framework that leverage various biochemical and evolutionary evidence to distinguish the gene regulatory categories of variants and assess their comprehensive functional impact simultaneously.

RESULTS: DVAR performed de novo pattern discovery in high-dimensional data and identified five regulatory clusters of non-coding variants. Leveraging the new insights into the multiple functional patterns, it measures both the between-class and the within-class functional implication of the variants to achieve accurate prioritization. Compared to other two-class learning methods, it showed improved performance in identification of clinically significant variants, fine-mapped GWAS variants, eQTLs and expression-modulating variants. Moreover, it has superior performance on disease causal variants verified by genome-editing (like CRISPR-Cas9), which could provide a pre-selection strategy for genome-editing technologies across the whole genome. Finally, evaluated in BioVU and UK Biobank, two large-scale DNA biobanks linked to complete electronic health records, DVAR demonstrated its effectiveness in prioritizing non-coding variants associated with medical phenotypes.

AVAILABILITY AND IMPLEMENTATION: The C++ and Python source codes, the pre-computed DVAR-cluster labels and DVAR-scores across the whole genome are available at https://www.vumc.org/cgg/dvar.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

DOI10.1093/bioinformatics/bty826
Alternate JournalBioinformatics
PubMed ID30256891
PubMed Central IDPMC6499232
Grant ListU01 HG009086 / HG / NHGRI NIH HHS / United States