Powerful gene-based testing by integrating long-range chromatin interactions and knockoff

Significance

Gene-based tests are important tools for elucidating the genetic basis of complex traits. Despite substantial recent efforts in this direction, the existing tests are still limited, owing to low power and detection of false-positive signals due to the confounding effects of linkage disequilibrium. In this paper, we describe a gene-based test that attempts to address these limitations by incorporating data on long-range chromatin interactions, several recent technical advances for region-based testing, and the knockoff framework for synthetic genotype generation. Through extensive simulations and applications to multiple diseases and traits, we show that the proposed test increases the power over state-of-the-art gene-based tests and provides a narrower focus on the possible causal genes involved at a locus.

Abstract

Gene-based tests are valuable techniques for identifying genetic factors in complex traits. Here, we propose a gene-based testing framework that incorporates data on long-range chromatin interactions, several recent technical advances for region-based tests, and leverages the knockoff framework for synthetic genotype generation for improved gene discovery. Through simulations and applications to genome-wide association studies (GWAS) and whole-genome sequencing data for multiple diseases and traits, we show that the proposed test increases the power over state-of-the-art gene-based tests in the literature, identifies genes that replicate in larger studies, and can provide a more narrow focus on the possible causal genes at a locus by reducing the confounding effect of linkage disequilibrium. Furthermore, our results show that incorporating genetic variation in distal regulatory elements tends to improve power over conventional tests. Results for UK Biobank and BioBank Japan traits are also available in a publicly accessible database that allows researchers to query gene-based results in an easy fashion.

Gene-based association tests are commonly used to identify genetic factors in complex traits. Relative to individual variant or window-based tests, they have appealing features, including improved functional interpretation and potentially higher power due to lower penalty for multiple testing. Due to the recent advances in massively parallel sequencing technologies, a large number of gene-based tests have been proposed in the literature to test for association with genetic variation identified in sequencing studies (16). One important limitation of the current gene-based tests is that they often fail to incorporate the epigenetic context in noncoding regions. Moreover, how to best analyze the noncoding part of the genome to increase power remains unclear. Recently, several sliding window approaches have been proposed to scan the genome with flexible window sizes and appropriate adjustments for multiple testing, while accounting for correlations among test statistics (7, 8). However, these approaches are essentially scanning the genome in a one-dimensional (1D) fashion and fail to take into account the three-dimensional (3D) structure of the genome (9). Furthermore, because they scan the genome agnostically, the burden of multiple testing is high, which may lead to low power to identify true associations. These 1D approaches also suffer from interpretability issues similar to genome-wide association studies (GWAS) and therefore require follow-up analyses to be performed in order to identify the target genes. Several existing tests, such as multimarker analysis of genomic annotation (MAGMA), high-throughput chromosome conformation capture (Hi-C)-coupled MAGMA (H-MAGMA), and an omnibus test in the variant-set test for association using annotation information framework (STAAR-O) (1012), attempt to link variants to their cognate genes based on physical proximity or chromatin-interaction data. We will compare our proposed tests to these existing approaches both conceptually and empirically, and we will show that our tests are more flexible and powerful than these existing tests. Furthermore, when individual-level data are available, the proposed tests can produce a more narrow list of associated genes at a locus by reducing the confounding effect of linkage disequilibrium (LD), a unique aspect of our gene-based test.

A related and popular gene-based strategy is the transcriptome-wide association studies (TWASs) (13, 14) that use GWAS data for a specific trait combined with genetic-variation gene-expression repositories, such as GTEx (15), to perform gene-based association tests. However, TWASs are limited to expression quantitative trait loci (eQTLs) being present in the reference datasets, and the majority of genetic associations cannot be clearly assigned to existing eQTLs (16, 17). Therefore, they may have reduced power to identify the relevant genes for the trait of interest.

Regulatory elements, including enhancers and promoters, play an important role in controlling when, where, and to what degree genes will be expressed. Most of the disease-associated variants in GWAS lie in noncoding regions of the genome, and it is believed that a majority of causal noncoding variants reside in enhancers (18). However, identifying enhancers and linking them to the genes they regulate is challenging. A number of methods have emerged in recent years to identify promoter–enhancer interactions. These techniques range from chromatin conformation capture (3C), which is limited to the detection of a single interaction, to circular chromosome conformation capture (which can detect all loci that interact with a single locus), to many-to-many mapping technologies possible using targeted enrichment. Hi-C maps the complete DNA interactome and elucidates the spatial organization of the human genome (1921). Hi-C provides direct physical evidence of interactions that may mediate gene-regulatory relationships and can aid in identifying putative regulatory elements for a gene of interest. However, due to the prohibitive sequencing costs of the Hi-C experimental technique, it is challenging to obtain high-resolution (e.g., 1 Kb) Hi-C data in a large number of cell types and tissues at multiple developmental times.

We propose here comprehensive gene-based association tests for common and rare genetic variation in both coding and noncoding regions, putative regulatory elements, and which incorporate several recent advances for region-based tests, including 1) scanning the genic and regulatory regions with varied window sizes; 2) the aggregated Cauchy association test (ACAT) to combine P values from single-variant, burden, and dispersion (sequence kernel association test [SKAT]) tests; 3) incorporation of multiple functional annotations; and 4) the saddlepoint approximation for unbalanced case-control data (2225). To further improve the power and the ability to prioritize putative causal genes at significant loci when individual-level data are available, we leverage a recent development in statistics, namely, the knockoff framework for knockoff genotype generation (26) that helps control the false discovery rate (FDR) under arbitrary correlation structure and attenuates the confounding effect of LD. One can think of the knockoff genotypes as synthetic, noisy copies of the original genotypes, which resemble the original data in terms of LD structure, but are conditionally independent of the trait of interest, given the original genotypes. Although conventional methods, such as the Benjamini–Hochberg (BH) procedure, are also designed to control the FDR (27), they cannot guarantee FDR control at the target level with arbitrarily correlated P values. Furthermore, unlike the knockoff framework implemented here, the conventional methods do not naturally account for correlations due to LD. The proposed gene-based test is related to a recently proposed window-based test, KnockoffScreen (8). Specifically, we employ the knockoff generation algorithm for genotype data that we have introduced in KnockoffScreen (8) and develop knockoff-based inference for gene-based tests. We demonstrate below that the proposed test has important advantages compared with the window-based test KnockoffScreen in terms of controlling the FDR at gene level. While KnockoffScreen can identify significant windows with valid FDR control at window level, functional interpretation of significant windows is still needed, which means that post hoc analyses need to be done to link those windows to relevant genes. However, as we show in simulations, this procedure can lead to highly inflated FDR at gene level.

We evaluate the performance relative to existing methods using simulations and applications to multiple studies, including GWAS studies for neuropsychiatric and neurodegenerative diseases, whole-genome sequencing studies for Alzheimer’s disease (AD) from the Alzheimer’s Disease Sequencing Project (ADSP), and for lung function from the National Heart, Lung, and Blood Institute Trans-Omics for Precision Medicine (TOPMed) Program. We also provide results of applications to UK Biobank and BioBank Japan binary and continuous traits.

Results

Overview of the Proposed Gene-Based Association Tests

We provide here a brief overview of the proposed gene-based tests that aim to comprehensively evaluate the effects of common and rare, coding, and proximal and distal regulatory variation on a trait of interest. A workflow depicting the overall gene-based testing approach proposed here is shown in Fig. 1. Briefly, we build our final test, GeneScan3DKnock, progressively, starting with a test focused on scanning the gene body region (i.e., the interval between the transcription start site [TSS] and the end of the 3′ untranslated region [UTR]) with varied window sizes. We refer to this test as GeneScan1D. We extend this test by incorporating genetic variants residing in putative regulatory elements, such as promoters and enhancers. In particular, we use chromatin immunoprecipitation sequencing (ChIP-seq) peak data extracted from the ChIP-Atlas database to identify promoter regions and data from the GeneHancer database to link enhancers to their target genes (28). We also use the activity-by-contact (ABC) model to predict functional enhancer–gene connections for five cell types and tissues (29). This is the GeneScan3D test. Finally, when individual-level data are available, we implement the knockoff framework for a more powerful gene-discovery and fine-mapping approach and refer to this test as GeneScan3DKnock.

Fig. 1

Workflow of the proposed gene-based tests. (A) GeneScan1D, a 1D scan of the gene and buffer region. (B) GeneScan3D, a 3D scan of the gene and regulatory elements linked to it. (C) GeneScan3DKnock, the knockoff-enhanced test, implementing a knockoff-based version of GeneScan3D.

We take advantage of recent advances in region-based tests for sequencing data (4, 22, 24) to perform computationally efficient and comprehensive tests with genetic variation in a gene (including variants located in proximal and distal regulatory elements), while scanning the gene with a range of window sizes for improved power. The framework allows for the incorporation of a variety of functional genomics annotations as weights for individual variants included in the tests. Furthermore, an aspect of our testing framework is the derivation of knockoff statistics based on the generation of knockoff (synthetic) genetic data that resemble the original genotypes in terms of correlation structure, but are conditionally independent of the outcome variable given the true genotypes (8, 26, 30). The…

Powerful gene-based testing by integrating long-range chromatin interactions and knockoff

Post a Comment

Previous Post Next Post