Main

GWASs in CD, and inflammatory bowel disease (IBD) more generally, have successfully identified more than 200 loci contributing to risk of disease1,2,3,4. While most GWAS hits do not immediately implicate an obvious functional variant or gene, a subset have been directly mapped to coding variants (for example, NOD2, IL23R, ATG16L1, SLC39A8, FUT2, TYK2, IFIH1, SLAMF8, PLCG2)5, providing more direct clues to pathogenesis. Further, targeted and genome-wide sequencing approaches have revealed additional, lower-frequency, disease-associated coding variants (for example, CARD9, RNF186, ADCY7, INAVA/C1orf106, SLC39A8, NOD2)6,7,8,9 originally undetected by GWASs. Such coding variants, common and rare, have led to functional follow-up experiments demonstrating causal mechanisms for at least ten genes and have provided the most direct biological insights to emerge from genetic studies of IBD10,11,12,13.

Results

To further advance the interpretation of GWAS loci—and to define novel CD-associated genes using variation rarer than that routinely detected by GWASs—we pursued large-scale exome sequencing using CD case and control collections from more than 35 centers in the International Inflammatory Bowel Disease Genetics Consortium (IIBDGC). The primary analysis consisted of exome sequencing of 18,816 CD cases across 35 IBD studies and 13,412 non-IBD control samples from the same studies. These samples were all sequenced at the Broad Institute and were supplemented with 22,536 population controls from approved non-IBD studies sequenced contemporaneously at the Broad Institute and accessed from dbGAP (Supplementary Table 1 and Extended Data Fig. 1). Two different exome capture platforms were employed during the course of the study (referred to hereafter as Nextera (Illumina) and Twist (Twist Biosciences)). Details of capture and sequencing of these cohorts (and those subsequently used in follow-up) are provided in the Methods and Supplementary Information.

Calling and quality control (QC) of data from the two exome capture platforms were conducted in parallel (Table 1, Extended Data Fig. 2 and Methods). Sensitivity to detect low-frequency coding variants was evaluated in each callset post-QC by comparison with passing sites in gnomAD v.2.1 that had 0.0001 < non-Finnish European (NFE) minor allele frequency (MAF) < 0.1 (Methods). We observed that 84% of all exonic SNPs in this frequency range were detected in both CD datasets with sufficiently high quality to enter meta-analysis. Analysis of each dataset was conducted in SAIGE using a logistic mixed-model14 and a standard inverse-variance weighted (IVW) meta-analysis conducted across 164,149 nonsynonymous variants with MAF (gnomAD NFE) between 0.0001 and 0.1. A study-wide significance threshold of 3 × 10−7 was applied (Supplementary Table 2). As a control for the entire study, we also analyzed 96,326 synonymous variants. Forty-three sites (Supplementary Table 3) failed a heterogeneity-of-effect test between the Nextera and Twist discovery cohorts (IVW PHET < 0.0001) and were eliminated from further analysis. We did not observe an inflation in the exome-wide distribution of test statistics.

Table 1 Sample characteristics

Exonic variants significantly associated with IBD

The most significantly associated variants (P < 10−10) in the Discovery stage were previously known CD variants (or variants in linkage disequilibrium (LD) with them), indicating that the QC and analysis pipeline removed highly-associated false positives. Twenty-eight variants achieved study-wide significance, including known variants within CD genes established in previous GWASs and sequencing efforts: NOD2, IL23R, LRRK2, TYK2, SLC39A8, IRGM and CARD9. Excluding synonymous variants in LD with these known associated nonsynonymous variants, synonymous variants showed little deviation from the null and none reached study-wide significance (Extended Data Fig. 3). Encouraged by this, we then nominated a list of 116 variants (including known variants) with P < 0.0002 for further evaluation in three follow-up cohorts (Supplementary Table 2).

Additional exome and genome sequencing was undertaken at the Sanger Institute on an independent cohort of 9,731 CD cases ascertained by the UK IBD Genetics Consortium and IBD BioResource. Genome sequencing with a target depth of 15× was performed on 6,000 patients with CD. Whole-genome sequences from 11,852 individuals from the INTERVAL blood donor cohort were used as population controls. Another 3,731 patients with CD were exome sequenced using the Agilent SureSelect Human All Exon V5 capture. In total, 33,704 individuals without IBD or other related diseases from the UK Biobank were used as controls for the Sanger whole-exome sequencing (WES) cases. These UK Biobank samples were sequenced by Regeneron using the IDT xGen Exome Research Panel v1.0 (including supplemental probes), and thus QC and subsequent analyses were restricted to the intersection of the Agilent and the IDT capture regions. Exome and genome datasets were processed in parallel with similar QC parameters (Methods). Association analyses were performed using a logistic mixed-effects model implemented in the REGENIE v.1.0.6.7 and v.2.0.2 software, correcting for the case–control imbalance using the Firth correction. Of 116 variants, 28 were associated (P < 4.3 × 10−4 (0.05/116)) with CD in the meta-analysis of the two Sanger cohorts and 94 replicated the direction of effect seen in the discovery cohort (P = 3 × 10−12, binomial test). Summary statistics from a German dataset of 4,071 CD cases and 4,223 controls exome sequenced at Regeneron (Methods) were also ascertained and a meta-analysis was carried out across all five cohorts (Table 1). Of the 116 variants, 45 exceeded the study-wide significance threshold, P < 3 × 10−7 (Supplementary Table 4 and Supplementary Data 1). Of note, all 45 study-wide significant sites in the discovery stage exome-wide scan showed stronger evidence of association in the meta-analysis, and none showed significant evidence of heterogeneity of effect across studies. The ‘scan’ exome-wide and the combined meta-analyses have similar power to detect the same true associations at their respective significance thresholds (Extended Data Fig. 4).

Among the 164,149 low-frequency nonsynonymous variants tested for association in this study, 14 were mapped to a credible set with posterior inclusion probability (PIP) > 5% from a previous IBD fine-mapping study5. Eight of the 14 variants reached exome-wide significance (in NOD2, IL23R, TYK2, PTPN22 and CARD9). The remaining six variants (in GPR65, MST1, NOD2 and SMAD3) have genetic effects consistent with those previously reported, with P values ranging from 0.025 to 8 × 10−5 in the WES discovery cohort. Together, these results demonstrate the accuracy of our exome sequencing study in the lowest frequency range covered by previous GWAS approaches.

To identify new loci not yet implicated in CD and independent exonic association signals at known loci, we accounted for the LD between the 45 exome-wide significant variants and previously reported IBD GWAS hits, as well as previous rare-variant discoveries (Methods). We identified five coding variants in genes not previously implicated in IBD, even by their proximity to previous GWAS signals. We also discovered six independently associated novel exonic variants in genes previously known to harbor coding mutations underpinning CD or IBD risk, two of which are in NOD2 (‘New locus’ and ‘New variant in known locus’ in Fig. 1, Supplementary Table 4 and Extended Data Fig. 5). Fourteen significant variants recaptured known IBD causal candidates from fine-mapping, including variants in CARD9, IL23R and NOD2, and the remaining 20 variants either tag the known causal variants through LD or have very small PIP from fine-mapping and thus are highly unlikely to be CD causal variants (‘Known causal candidate’ and ‘Unlikely causal’ in Fig. 1, Supplementary Table 4 and Extended Data Fig. 5). A harmonized summary with findings from this study and the fine-mapping study5 for genes implicated by the 45 exome-wide significant variants is available in Supplementary Table 5. Of note, evidence from earlier GWASs and this study cannot be considered independent and trivially combined since there is considerable sample overlap—while 46% of scan samples come from clinical sites or national cohorts not previously involved in the largest GWAS4, the remainder are from sites which also contributed earlier recruited samples (generally 10 or more years ago) to previously reported GWASs and have been exome sequenced in this study. Similarly, at least 10% of the Sanger whole-genome sequencing (WGS) samples used for meta-analysis were previously included in past genotyping studies.

Fig. 1: Odds ratio and MAF for exome-wide significant findings that are not tagging stronger, established noncoding association signals.
figure 1

Known causal candidate: in a credible set from a fine-mapping study5 with PIP > 5% or reported in previous studies6,8 (Methods). New locus: in a locus not yet implicated by GWASs. New variant in known locus: in a known GWAS locus, but represents an association independent from previously reported IBD putative causal variants (Methods).

Some of the newly implicated CD genes (Table 2 and Box 1) contribute to biological pathways previously implicated via GWASs, such as autophagy (ATG4C), or Mendelian forms of IBD, such as the IL-10 signaling pathway15, or by extensive functional studies of inflammatory response in IBD, such as the NF-κB family of transcriptional regulators16. In contrast, many of the newly associated genes appear to be linked to the roles of mesenchymal cells (MCs) in intestinal homeostasis, a pathway not previously implicated by genetic studies. The mesenchyme is composed of nonhematopoietic, nonendothelial and nonepithelial cells such as fibroblasts, myofibroblasts (stromal cells) and pericytes17. In the intestine, mucosal MCs act as a second barrier through their interactions with both immune and epithelial cells18. Under physiological conditions, MCs regulate immune cell maturation, migration and recruitment of immune cells19 as well as maintenance of the stem cell niche in the intestinal crypt and mucosal repair through epithelial-to-mesenchymal transition (EMT)17,18,20. MCs are highly activated by proinflammatory signals during chronic inflammation, resulting in subepithelial myofibroblasts proliferation and extracellular matrix production, and, not surprisingly, are involved in the development of fibrotic disorders21.

Table 2 Novel variants achieving study-wide significance that implicate genes directly in general onset CD

Among the newly discovered IBD genes, PDLIM5 is highly expressed in subepithelial myofibroblasts22. PDLIM5 is a cytoskeleton-associated protein well known as a regulator of EMT through TGF-β1/SMAD3 signaling23. Knocking out its expression also leads to an alteration in extracellular matrix assembly, specifically by decreasing collagen IV network density23,24,25,26 (Fig. 2). Interestingly, SDF2L1, also among the newly identified CD genes, has previously been shown to be elevated in plasma cells within the lamina propria of patients with CD failing to achieve durable remission on anti-TNF therapy27. SDF2L1, which is induced in response to endoplasmic reticulum (ER) stress, is also expressed in Paneth and goblet cells28. Impairment of ER stress is closely linked to intestinal inflammation29. Therefore, a mutation in this gene could putatively impair epithelial homeostasis in many ways such as preventing goblet cell differentiation, migration and proper production of mucus30 or perturbing the production of antibodies by plasma cells31.

Fig. 2: Schematic representation of inflamed mucosa showing the mesenchymal-related genes with newly identified mutations.
figure 2

(1) Following mucosal injury, MCs are highly activated by proinflammatory signals such as TNF-α, CCL19/CCL21, PAF and TGF-β1. (2) Among these, TNF-α increases PTAFR expression in intestinal epithelial cells during wound repair46. However, prolonged exposure to PAF dissolves cell junctions and increases epithelial permeability37. In endothelial cells, the PAF-R/PAF axis has a similar effect on vascular endothelial-cadherin (VE-CAD) assembly34. A leaky endothelium can result in an increase in immune cell infiltration and aggravate inflammation at injured sites47. (3) Secretion of CCL19/CCL21 by activated stromal fibroblasts in response to epithelial damage or infection attracts dendritic cells (DCs) and other immune cells which then migrate to mesenteric lymph nodes, where the immune response is coordinated. CCR7+ DCs, macrophages and T cells also exacerbate inflammation through proinflammatory mediators such as CCL19/CCL21. Plasma cells express SDF2L1 in response to inflammation and ER stress due to massive antibody production48. (4) Mucosal repair mediated by TGF-β1/SMAD3 signaling has a key role in epithelial homeostasis after tissue injury49. Importantly, a causal variant in SMAD3 further supports the importance of this pathway in disease susceptibility5. PDLIM5 is a known regulator of SMAD3 stability during EMT23. Uncontrolled EMT increases fibroblast proliferation and excessive extracellular matrix (ECM) production leading to fibrosis43. Active HGF released by HGFAC antagonizes TGF-β1 resulting in a decrease of EMT. (5) HGF secreted by fibroblasts plays a role in maintaining the stem cell niche in intestinal crypts50. SDF2L1 expressed by Paneth cells in response to ER stress may participate in this process. (6) The current genetic findings provide support to the scientific rationale for targeting EMT and fibrosis for the treatment of CD, such as with anti-integrin antibodies (anti-αvβ6), recombinant human HGF (rhHGF) and Rho kinase inhibitor (ROCKi); shown in red22.

HGFAC, PAF-R and CCR7 can be linked to IBD-relevant MC functions via their known ligands. Specifically, HGFAC is a serine protease that cleaves hepatocyte growth factor (HGF) to its active form. It has previously been shown that HGF is a paracrine factor secreted by stromal cells (fibroblasts and myofibroblasts) that regulates epithelial homeostasis—in particular, the balance between epithelial proliferation, differentiation and apoptosis—and has been shown to be elevated in the serum of patients with CD20,32. In human kidney epithelial cells, HGF has been shown to have antifibrotic properties by upregulating SMAD co-repressor SnoN, resulting in inhibition of EMT, and likely plays a similar role in the intestinal environment33. Platelet activating factor receptor (PTAFR - also commonly known as PAF-R) is expressed in epithelial and endothelial cells as well as in pericytes, a population of MCs surrounding blood vessels that regulates angiogenesis34,35. PTAFR is a G-protein-coupled receptor for platelet activating factor (PAF), a proinflammatory lipid that is elevated in the mucosa of patients with CD, potentially reflecting disease activity36. The PAF-R/PAF axis is known to regulate endothelial and epithelial permeability, which is associated with inflammatory diseases37,38. Finally, CCR7+ immune cells have a macrophage-like or dendritic cell morphology. It is known that mesenchymal stromal cells induce CCR7+ dendritic cell migration to mesenteric lymph nodes within inflamed mucosa of patients with IBD39. It is also known that CCR7 ligands, CCL19/CCL21, are highly expressed in a recently identified population of proinflammatory stromal cells that appear to prevent the resolution phase that is normally found as part of the wound-healing process22. Our identification of CD-associated rare coding variants in these genes suggests that perturbation of these finely balanced cellular processes that are key to intestinal homeostasis causally contributes to CD susceptibility.

The identified coding variants in RELA, TAGAP and SDF2L1 are close to, but not in LD (r2 < 0.05) with, common noncoding variants significantly associated with IBD risk via GWASs (Methods and TAGAP as an example in Extended Data Fig. 6a–c). These very likely pinpoint the genes dysregulated by the associated common variant and provide a focus for uncovering the function of those variants, perhaps leading to allelic series of perturbations further informing on the mechanism of their contribution to CD pathogenesis. The associated missense variant in HGFAC is in partial LD (r2 = 0.35 in 1000 Genomes NFE populations) with a common noncoding variant (rs3752440) previously reported as associated with CD2. Unfortunately, the missense variant was not included in this previous study—precluding formal assessment of whether this explains the previously observed association signal or represents an independent variant directly implicating HGFAC—although we note the missense variant here has a higher odds ratio and greater significance than the variant in the previous report. The two novel NOD2 associations are not in LD with previously reported putative causal variants; one modestly reduces basal activity and has at least twofold reduction in peptidoglycan-induced NF-κB response40, while the other is a splice donor variant (Supplementary Table 6). None of the variants described in Table 2 has reached genome-wide significance in previously published GWASs (variant in HGFAC has almost reached significance in ref. 4, P = 5 × 10−8 for IBD). The nine new CD-associated variants all had an info score of 1 in the UK Biobank (UKBB) GWAS imputation41, except for PTAFR and PDLIM5, which had info scores of 0.72 and 0.9, respectively. Novel variants described in Table 2 together explain around 0.12% of the variance on the liability scale (0.3% on the observed scale). In comparison, the 25 independent coding variants that were included in the meta-analysis together explained 2.1% of variance (5.1% on the observed scale). We performed two gene-based rare-variant (MAF < 0.001) burden tests in the full-exome Nextera and Twist datasets using SAIGE-GENE42, one restricted to loss-of-function variants and another using all nonsynonymous variants (Supplementary Table 7). The burden test meta-analysis was performed on 15,823 genes for the nonsynonymous variant analysis and 3,953 for the predicted loss-of-function (pLoF) variant only analysis. Correcting for 20,000 genes, associations with P < 2.5 × 10−6 were considered statistically significant. NOD2 unsurprisingly stood out far above the expected distribution (P values from gene-based burden tests using pLoF and non-synonymous variants are 7.7 × 10−7 and < 10−16, respectively). Only one other gene in either analysis exceeded the threshold expected once in the study by chance (ATG4C, NonSynP = 3.3 × 10−6). This potentially novel signal in ATG4C was driven by three distinct missense variants with individual P < 0.01 (N75S, R80H and C367Y) (Supplementary Table 8) along with two others with P < 0.05 (K371R, R389X). The ATG4C gene burden signal was examined in the Sanger datasets and replicated, with the meta-analysis reaching exome-wide significance (P = 1.5 × 10−7) driven by several of the same variants. Further examination of results from the single-variant tests in ATG4C identified a frameshift variant with frequency of 0.002 (1:62834058-TTG-T)—too high to be included in our burden test—that just missed our threshold for testing in the follow-up cohorts (P = 0.0003, Beta = 0.55 in the Broad meta-analysis). This variant also showed evidence of association in the meta-analysis of the Sanger cohorts (P = 1.3 × 10−5), and also exceeded our study-wide significance threshold in the five-way meta-analysis of all cohorts (P = 1.55 × 10−8). Of further note, an additional ATG4C frameshift variant specifically enriched in Finland (1:62819215:C:CT) is associated with IBD (P = 6.91 × 10−8, Beta = 1.20) in the publicly released FinnGen resource (r5.finngen.fi). All variants in burden and individual tests increase risk, and the presence of four truncating variants in these analyses suggests that loss-of-function variants in ATG4C strongly increase CD risk.

Discussion

Here, we demonstrate that large-scale exome sequencing can complement GWASs by pinpointing specific genes both indirectly implicated by GWASs as well as those not yet observed in GWASs. With high sensitivity to directly test individual variants down to 0.01% MAF, as well as assess burden of ultra-rare mutations, we begin to fill in the low-frequency and rare-variant component of the genetic architecture of CD. This component was not observable by earlier generations of CD GWAS meta-analyses, which have had more limited coverage of low-frequency and rare variation.

Past findings in IBD5, and most other complex diseases, suggest that while coding variants are vastly outnumbered by noncoding variation, they are highly enriched for associations to common and rare diseases. Furthermore, associated coding variants tend to have stronger effects than their noncoding counterparts, often keeping them lower in frequency via natural selection. While this alone validates the use of exome sequencing for efficiency’s sake, the primary advantage of targeting coding regions for discovery is that coding variants uniquely pinpoint genes, and often pathogenetic mechanisms, in a fashion that is at present far more challenging to achieve routinely for noncoding associations. In the case of several of the new findings (for example, RELA, TAGAP), the coding variation here provides concrete evidence of genes previously indirectly implicated by independent noncoding GWAS associations. These identify the likely gene underlying these associations and build allelic series of natural perturbations at these genes. Moreover, IL10RA and RELA are known to harbor mutations causing rare, Mendelian, inflammatory gastrointestinal disorders, and this study extends the phenotypic spectrum resulting from perturbing genetic variation to more complex forms of CD. From a functional perspective, the novel genes identified in the current study reiterate the central roles of innate and adaptive immune cells as well as autophagy in CD pathogenesis. Moreover, the involvement of PDLIM5, SDF2L1, HGFAC, PAF-R and CCR7 pathways, in addition to the previously reported causal variant in SMAD3 (ref. 5), highlights the emerging role of MCs in the development and maintenance of intestinal inflammation (Fig. 2)18. Also, while previous studies have demonstrated the disruption of MC biology in IBD, the current findings of coding variants in these genes demonstrate that these cells and functions causally contribute to disease susceptibility. Furthermore, the association of these pathways with CD pathogenesis provides an additional rationale for development of therapeutic modalities that can re-establish the balance to the mesenchymal niche, as it is believed that genetic evidence for a drug target has a measurable impact drug development43,44.

We expect that, in the next year, expanded sequencing efforts underway in ulcerative colitis will come to completion, enabling a more comprehensive survey of low-frequency and rare variation in ulcerative colitis, and IBD in general. Integrated with a much larger GWAS spearheaded in parallel by the IIBDGC, we expect a substantial number of conclusively linked genes and informative allelic series to emerge.

Methods

Ethics declarations

All relevant ethical guidelines have been followed, and any necessary institutional review board (IRB) and/or ethics committee approvals have been obtained. The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Study Protocol 2013P002634, The Broad Institute Study of Inflammatory Bowel Disease Genetics, undergoes annual continuing review by the Mass General Brigham Human Research Committee IRB of Mass General Brigham. Ethical approval was given on 27 January 2021 for this study (Mass General Brigham IRB).

All informed consent from participants has been obtained and the appropriate institutional forms have been archived.

DNA samples sequenced at the Sanger Institute were ascertained under the following ethical approvals: 12/EE/0482, 12/YH/0172, 16/YH/0247, 09/H1204/30, 17/EE/0265, 16/WM/0152, 09/H0504/125, 15/EE/0286, 11/YH/0020, 09/H0717/4, REC 22/02, 03/5/012, 03/5/012, 2000/4/192, 05/Q1407/274, 05/Q0502/127, 08/H0802/147, LREC/2002/6/18, GREC/03/0273 and YREC/P12/03.

Broad Institute sequencing pipeline

Sample processing

Exome sequencing was performed at the Broad Institute. The sequencing process included sample preparation (Illumina Nextera, IIlumina TruSeq and Kapa Hyperprep), hybrid capture (Illumina Rapid Capture Enrichment (Nextera), 37 Mb target, and Twist Custom Capture, 37 Mb target) and sequencing (Illumina HiSeq2000, Illumina HiSeq2500, Illumina HiSeq4000, Illumina HiSeqX, Illumina NovaSeq 6000, 76-base pair (bp) and 150-bp paired reads). Sequencing was performed at a median depth of 85% targeted bases at >20×. Sequencing reads were mapped by BWA-MEM to the hg38 reference using a ‘functional equivalence’ pipeline. The mapped reads were then marked for duplicates, and base quality scores were recalibrated. They were then converted to CRAMs using Picard 2.16.0-SNAPSHOT and GATK 4.0.11.0. The CRAMs were then further compressed using ref-blocking to generate gVCFs. These CRAMs and gVCFs were then used as inputs for joint calling. To perform joint calling, the single-sample gVCFs were hierarchically merged (separately for samples using Nextera and Twist exome capture).

QC

QC analyses were conducted in Hail v.0.2.47 (Extended Data Fig. 2). We first split multiallelic sites and coded genotypes with genotype quality (GQ) < 20 as missing. Variants not annotated as frameshift, inframe deletion, inframe insertion, stop lost, stop gained, start lost, splice acceptor, splice donor, splice region, missense or synonymous were removed from the following analysis. We also removed variants that have known quality issues (have a nonempty QUAL column) in the gnomAD dataset. Sample QC: poor-quality samples that met the following criteria were identified and removed: (1) samples with an extremely large number of singletons (≥500); (2) samples with mean GQ < 40; and (3) samples with missingness rates > 10%. Variant QC: low-quality variants that met the following criteria were identified and removed: (1) variants with missingness rate > 5%; (2) variants with mean read depth (DP) < 10; (3) variants that failed the Hardy–Weinberg equilibrium test for controls with P < 1 × 10−4; and (4) variants with >10% samples that were heterozygous and with an allelic balance ratio <0.3 or >0.7. Variants with different genotypes in WES and WGS in gnomAD were also removed. For Twist exome capture samples, we additionally removed (1) samples that had a significantly high or low inbreeding coefficient (>0.2 or ≤0.2); (2) samples that had a high heterozygosity away from mean (±5 s.d.); and (3) related samples, which were removed sequentially by removing the individual with the largest number of related samples (in PLINK, the individual with PI_HAT > 0.2 when using the ‘–genome’ option) until no related samples remained. For Nextera capture samples, we additionally removed variants showing a significant heterogeneous effect across Ashkenazi Jewish (AJ), Lithuanian (LIT), Finnish (FIN) and NFE samples (see Population assignment below).

Population assignment

We projected all samples onto principal component (PC) axes generated from the 1000 Genomes Project Phase 3 common variants, and classified their ancestry using a random forest method to the European (CEU, TSI, FIN, GBR, IBS), African (YRI, LWK, GWD, MSL, ESN, ASW, ACB), East Asian (CHB, JPT, CHS, CDX, KHV), South Asian (GIH, PJL, BEB, STU, ITU) and American (MXL, PUR, CLM, PEL) samples. We kept samples that were classified as European with prediction probability greater than 80% (Extended Data Fig. 7). For Nextera samples, we used a second random forest classifier to assign EUR samples to AJ, LIT, FIN or NFE, and a third random forest classifier to clean the AJ/NFE split.

Meta-analysis

We used METAL87 with an IVW fixed-effect model to meta-analyze the SAIGE association statistics from Nextera and Twist samples (Table 1). The heterogeneity test was performed using Cochran’s Q with one degree of freedom.

Sanger Institute sequencing pipeline

Sample processing

Genome sequencing was performed at the Sanger Institute using the Illumina HiSeqX platform with a combination of PCR (n = 4,751, controls only) and PCR-free library preparation protocols. Sequencing was performed at a median depth of 18.6×. Exome sequencing of cases was performed at the Sanger Institute using the Illumina NovaSeq 6000 and the Agilent SureSelect Human All Exon V5 capture set. Controls from the UK Biobank were sequenced separately as a part of the UKBB WES50K release using Illumina NovaSeq and the IDT xGen Exome Research Panel v1.0 capture set (including supplemental probes). In total, 33,704 UKBB participants were selected for use as controls, excluding participants with recorded or self-reported CD, ulcerative colitis, unspecified noninfective gastroenteritis or colitis; any other immune-mediated disorders; or a history of being prescribed any drugs used to treat IBD. Exome and genome datasets were analyzed separately but followed a similar analysis protocol.

Reads were mapped to hg38 reference using BWA-MEM v.0.7.12 (WGS) and v.0.7.17 (WES). Variant calls were performed using a GATK Best Practices-like pipeline (v.4.0.10.1 (WGS) and v.4.1.8 (WES)); per-sample intermediate variant calling was followed by joint genotyping across the individual genome and exome cohorts. For the exome cohort, variant calling was limited to Agilent extended target regions. Per-region VCF shards were imported into the Hail software and combined. Multiallelic sites were split. For the exome cohort, we subsetted the calls to the intersection of Agilent and IDT exome captures, further excluding regions recommended for exclusion by the UKBB due to an error in read mapping that results in no variant calls made.

Population assignment

We selected a set of ~14,000 well-genotyped common variants to identify the genetic ancestry of individual participants through the projection of 1000 Genomes Project cohort-derived PCs. For genomes, due to primarily European genetic ancestry of the controls, we excluded samples outside of 4 median absolute deviations from the median point of the European ancestry cluster of 1000 Genomes. For exomes, we implemented a Random Forest technique that classified samples based on PCs into broad genetic ancestry groups (EUR, AFR, SAS, EAS, admixed), with self-reported ancestry as training labels. For these analyses, we only retained the EUR samples, as the number of cases for other groups was too small for robust association analysis.

QC

A combination of hard-cutoff filters and per-ancestry/per-batch outlier filters was used to identify low-quality samples. We applied hard-filters for sample depth (>12× genomes, >15× exomes), call rate (>0.95), chimerism (<0.5) (WGS) and FREEMIX (<0.02) (WGS). We excluded genotype calls with an allelic imbalance (for heterozygous calls, allelic balance ratio < 0.2 or > 0.8), low depth (<2×) and low GQ (<20). We then performed per-ancestry and per-sequencing protocol (AGILENT versus IDT for WES, PCR versus PCR-free for WES) filtering of samples falling outside 4 median absolute deviations from the median per-batch heterozygosity rate, transition/transversion rate, number of called SNPs and INTELs, and insertion and deletion counts/ratio.

An ancestry-aware relatedness calculation (PC-Relate method in Hail88) was used to identify related samples. As our association approach (logistic mixed-models) can control for residual relatedness, we only excluded duplicates or MZ twins from within the cohorts and excluded first-, second- and third-degree relatives when the kinship was across the cohorts (for example, parent in WGS, child in WES; kinship metric > 0.1 calculated via PC-Relate method using 10 PCs). In addition, we removed samples that were also present in the Broad Institute’s cohorts.

Association testing

Association analysis was performed using a logistic mixed-model implemented in REGENIE software v.1.0.6.7 (singe-variant) and v.2.0.2 (burden). A set of high-confidence variants (>1% MAF, 99% call rate and in Hardy–Weinberg equilibrium) was used for t-fitting. To control for case–control imbalance, Firth correction was applied to P values < 0.05. To control for residual ancestry and sequencing heterogeneity, we calculated 10 PCs on a set of well-genotyped common SNPs, excluding regions with known long-range LD. These were used as covariates for association analyses. Only variants with call rate above 90% after filtering poor calls were included in the association analysis. For WES, we verified that the >90% call rate condition holds true in both AGILENT and IDT samples. Association analysis was performed on QC-passing calls.

Kiel/Regeneron sequencing pipeline

Sample preparation and sequencing

The DNA samples were normalized and 100 ng of genomic DNA was prepared for exome capture with custom reagents from New England Biolabs, Roche/Kapa and IDT using a fully automated approach developed at the Regeneron Genetics Center. Unique, asymmetric 10-bp barcodes were added to each side of the DNA fragment during library preparation to facilitate multiplexed exome capture and sequencing. Equal amounts of sample were pooled before exome capture with a slightly modified version of IDT’s xGen v1 probes; supplemental probes were added to capture regions of the genome well covered by a previous capture reagent (NimbleGen VCRome) but poorly covered by the standard xGen probes, the same as the probe library used in UK Biobank exome sequencing. These supplemental probes were included in QC but excluded in the final analysis as we only looked up variants that were in the standard exome captures and reached the nominal significance for replication (Extended Data Fig. 1). Captured fragments were bound to streptavidin-conjugated beads and nonspecific DNA fragments were removed by a series of stringent washes according to the manufacturer’s recommended protocol (IDT). The captured DNA was PCR amplified and quantified by quantitative reverse transcription PCR (Kapa Biosystems). The multiplexed samples were pooled and then sequenced using 75-bp paired-end reads with two index reads on the Illumina NovaSeq 6000 platform using S2 flow cells.

Variant calling and QC

Sample read mapping and variant calling, aggregation and QC were performed via the SPB protocol described by Van Hout et al.89. Briefly, for each sample, NovaSeq WES reads are mapped with BWA-MEM 0.7.17-r1188 to the hg38 reference genome. Small variants are identified with WeCall v.1.1.2 and reported as per-sample gVCFs. These gVCFs are aggregated with GLnexus into a joint-genotyped, multi-sample VCF (pVCF). SNV genotypes with DP less than 7 and indel genotypes with DP less than 10 are changed to no-call genotypes. After the application of the DP genotype filter, a variant-level allele balance filter is applied, retaining only variants that meet either of the following criteria: (1) at least one homozygous variant carrier or (2) at least one heterozygous variant carrier with an allele balance greater than the cutoff.

Analysis

We combined the gVCF files with bcftools 1.11 using the ‘merge’ command, then imported the joint VCF into Hail. We then split the multiallelic variants and removed variants with ‘<NON_REF>’ alternative alleles. We applied the QC steps and assigned populations as in the Broad Institute sequencing pipeline.

Statistics and reproducibility

Previous studies show that a large sample size is needed for IBD genetic studies. We have thus included all samples available to us. We excluded samples of non-European ancestries due to their very limited sample size when properly matched between cases and controls (Extended Data Fig. 7). We also excluded data of poor quality from the analysis (Extended Data Fig. 2). These exclusions were necessary to ensure the quality of this study. All criteria were pre-established. We used the logistic mixed-model for the association analysis, followed by meta-analyses to combine multiple cohorts. We have multiple cohorts in the study that serve the purpose of replication. Two large cohorts done at the Broad Institute of different exome capture platforms were used to discover candidate variants. Two independent cohorts done at Sanger and one Kiel/Regeneron cohort were used to replicate the findings (Extended Data Fig. 1). All reported findings have been replicated. No randomization was conducted. No blinding was carried out. Code and pipelines to reproduce our analysis are available on Zenodo90.

Cross-cohort meta-analysis

We used the Cochran–Mantel–Haenszel test to combine association summary statistics between the Broad Institute, Sanger Institute and Kiel/Regeneron cohorts.

Relation to known IBD causal variants

We assigned the 45 study-wide significant variants to one of the four categories based on their relation with known IBD associations and/or fine-mapping results (Extended Data Fig. 5 and Supplementary Table 4): (1) Known causal candidate: variants in a fine-mapping credible set5 with PIP > 5%, or reported in the earlier sequencing studies after manual review6,8. (2) New locus: variants implicating a genetic locus in general onset CD that have not been previously reported. (3) Unlikely causal: variants with PIP < 5%, or variants tagging the best PIP variants using conditional analysis (see Conditional analysis below, LRRK2 shown as an example in Extended Data Fig. 6d–g). (4) New variant in known locus: variants in known GWAS loci with MAF < 0.5% (and, thus, no LD to evaluate tagging) remain study-wide significant after conditional analysis using the LD from gnomAD (TAGAP shown as an example in Extended Data Figure 6a–c), or after manual review (Exceptions and notes).

Variance explained

Using the Sanger WGS data (6,000 cases, 11,852 controls), we fitted a series of univariate logistic regression (is_case ~ variant_genotype) models and estimated the pseudo-r2. Pseudo-r2 estimates were summed to estimate the observed-scale variance explained by a group of variants. To convert the estimate into an estimate of heritability on the liability scale, we assumed that the prevalence of CD is 276 in 100,000 (UK estimate from ref. 91).

Conditional analysis

For study-wide significant variants not in a previously reported credible set5, we performed a conditional analysis to test whether they are independent from or tagging the known causal variants5. We first classified variants as ‘tagging’ if they had r2 > 0.8 with any variants in the reported credible sets5. For other variants, we performed a conditional analysis using (1) the P value estimates from previous fine-mapping studies for credible set variants and (2) the LD calculated from gnomAD. We were unable to directly fit a multivariate model or use the LD from study subjects, because exome sequencing does not cover the noncoding putative causal variants, and the ImmunoChip does not have good quality for rare coding variants. The conditional z statistic, \(z_{\mathrm{Seq}}^\prime\), for a variant with marginal statistic of zSeq from our study, was calculated as follows:

$$z_{\mathrm{Seq}}^\prime = - \frac{{\left| {z_{\mathrm{Seq}}} \right| - \mathop {\sum}\nolimits_i^n {\left( {\left| {z_{\mathrm{FM}_i} \ast r_i \ast \sqrt {N_{\mathrm{Seq}}/N_{\mathrm{FM}}} } \right|} \right)} }}{{\mathop {\prod}\nolimits_i^n {\sqrt {1 - r_i^2} } }}$$

in which \(z_{\mathrm{FM}_i}\) is the z statistic of the variant with the best PIP in the credible set i, out of n total credible sets, from the fine-mapping study, ri is the LD between the two variants, and NSeq and NFM are the effective sample sizes for our study and the fine-mapping study, respectively. We used the absolute value in this equation because of the challenges to align the alleles across sequencing, the fine-mapping study and the gnomAD reference panel. Taking the absolute value is a conservative approximation (less likely to declare a variant as novel association) because it assumes that the putative causal variants from fine-mapping have the same direction of effect as the variant being tested when they are in LD. This is very likely to be correct. The effective sample size was calculated as \(4/\left( {1/N_{\mathrm{case}} + 1/N_{\mathrm{control} }} \right)\), in which Ncase and Ncontrol are the sample sizes for cases and controls, respectively. For each variant, we summed the effective sample sizes across all cohorts in which the variant is observed (thus, NSeq can differ from variant to variant). We calculated the conditional P value from \(z_{\mathrm{Seq}}^\prime\) under the standard Gaussian distribution. A variant was classified as ‘tagging’ if the conditional P value failed to reach study-wide significance at 3 × 10−7.

Exceptions and notes

HGFAC: despite this locus having been reported in an earlier GWAS2, the coding variant we identify was not tested for association due to incomplete coverage of this region, and is thus reported in this study as directly implicating this gene (r2 = 0.35 with the previously reported GWAS SNP, rs2073505). We thus assign this variant as ‘New variant in known locus’. RELA: similar to HGFAC, this locus has been reported in an earlier GWAS2, but the coding variant we identified was not tested for association due to incomplete coverage of this region, and thus is reported in this study as directly implicating this gene (r2 = 0.002 with the previously reported GWAS SNP, rs568617). We thus assign this variant as ‘New variant in known locus’. SLC39A8: the SLC39A8 A391T variant was not reported in the fine-mapping paper, as its genetic region was not included in the ImmunoChip design. Because this variant has been published in several papers as an IBD causal variant with genetic and functional evidence92,93,94, we assign this variant as ‘Known causal candidate’. TYK2: the TYK2 A928V was not reported in the fine-mapping paper5, likely due to a lack of power. Because this variant has been known to be a causal variant for several autoimmune disorders95 and in another IBD study96, we assign this variant as ‘Known causal candidate’. SDF2L1: this variant has marginal P = 2 × 10−7 and conditional P = 3.4 × 10−4. The r2 between this variant and the noncoding variant with the best PIP from fine-mapping is 0.045. We manually assigned this variant to ‘New variants in known locus’, as this is a missense variant. NOD2: (1) Previous studies5,6,7 have shown evidence that the NOD2 S431L variant tags the NOD2 V793M variant, with the latter more likely to be the CD causal variant. In this study, however, S431L reached study-wide significance, but V793M failed to meet the significance cutoff. We therefore retained S431L in Fig. 1 for the purpose of keeping this association signal. (2) Due to the complexity of the NOD2 locus, we conducted a haplotype analysis using the Twist subjects and additionally classified signed variants that share the same haplotype with known IBD variants as ‘tagging’. We found that for the NOD2 S47L variant, 18 out of 19 copies of the T allele are on the same haplotype as the fs1007insC variant. We therefore classify S47L as ‘tagging’. (3) The NOD2 A755V variant is in LD with rs184788345, the best PIP variant from fine-mapping (r2 = 0.85). The marginal P value for A755V is one order of magnitude less significant than rs184788345. Considering A755V is a missense variant while none of the variants in the credible set defined by rs184788345 is coding, we assign A755V as a likely ‘Known causal candidate’.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.