6 Discussion

The epigenomics of ageing has enormous potential for growth in the coming years. DNA methylation based ageing biomarkers have a bright future as a reliable and convenient broad indicators of biological ageing, that have the potential to be used as proxies for longevity in clinical trials intervening in ageing. For instance there is growing interest and investment in drugs with potential anti-ageing activities such as senolytics [462], and in re-purposing of existing drugs like metformin for pro-longevity and increased healthspan interventions [205]. Advances in epigenetic editing [463] promise the possibility of experimentally establishing causal roles of age related epigenetic changes, and the ability to dissect the mechanisms of the involvement of epigenetic changes in the processes of ageing. Improvements are taking plance in DNA methylation assay technologies such as NEBNext Enzymatic Methyl-seq ‘EM-seq’ [135], better DNA methylation calling in nanopore sequencing [464], advances in single-cell DNA methylation assays [465], and Illumina methylation arrays for sites conserved across mammals [203]. With these tools there are many additional opportunities to characterise the DNA methylomes and capture changes in DNA methylation which were not previously accessible, as well as to begin more mechanistic studies.

6.1 Epigenomic analysis of the developmental origins of long-term bone health

The developmental origins of health and disease hypothesis (DOHaD) holds that early life environmental influences have long term consequences for the risk of developing various pathologies in adulthood and later life. It is with this lens that the EWAS for relationships between umbilical cord DNA methylation and bone health outcomes in Chapter 3 were framed. DNA methylation being an epigenetic mark is influenced by environmental factors and is heritable by subsequent generations of cells, thus it could be a medium through which environmental factors could act on long term health in accordance with DOHaD. In this chapter two CpGs were identified as having genome wide significant associations with the outcome of interest in their respective EWAS. The first of these was CpG cg26559250 which is located at Chr6:157,653,445-157,653,447 at the ZDHHC14 (zinc finger DHHC-type palmitoyltransferase 14) gene. This CpG was identified In the EWAS for total bone mineral content minus head at 6 years adjusted for age and sex, with a p-value of \(2.52\times 10^{-8}\) for an increase of 1.46% per kg in an uncorrected model and a corrected model. The Second was CpG cg22570676 located at Chr19:2,527,492-2,527,494 at the GNG7 (G protein subunit gamma 7) gene. This CpG was identified on the EWAS for periosteal circumference at 38% from the distal end of the tibia at 6 years (mm) adjusted for age and sex, with a p-value of \(4.24\times 10^{-8}\) for an increase of 0.370% per mm in an uncorrected model and a corrected model The corrected models included covariates for: blood cell-type composition, maternal age at time of birth (years), sex, maternal BMI at 11 weeks gestation, parity, whether or not the mother smoked during pregnancy, and gestational age.

EWAS were performed for nine different outcomes across three groups of samples, and in each case four different models were fitted. Within a given EWAS the quite stringent Bonferroni standard for multiple testing correction was applied, however conducting multiple EWAS across different groups creates a secondary multiple testing problem. This means that these findings could still be false positives despite the aspiration of family wise correction to minimise type 1 errors. Adjusting for the number of tests performed in a given EWAS should in theory minimise false positives but if several are performed the effective number of tests increases and is not adjusted for increasing the probability that a result could be a false positive from the near zero level family wise correction is aimed at achieving. Confirmation of these associations in another cohort would be necessary in order for the biological reality of these associations to be asserted with confidence. These results are correlational and experimental follow up would be needed to establish any mechanistic or causal relationships between the DNA methylation state at these sites and bone properties in early life. One could for example attempt epigenetic editing of orthologous sites in a model organism and look for an effect on bone measurements [463].

This study did not find significant correlations between the examined bone measurements at CDKN2A, where previously an inverse relationship between DNA methylation and bone size, mineral content and mineral density at 4 years had been documented [264]. There was ample opportunity to see changes at this locus in the data, with 95 probe sites in the vicinity of this gene. Nor did this study see significantly reduced DNA methylation at RXRA in umbilical cord with maternal vitamin D supplementation or increased circulating vitamin D at the 75 probe sites near RXRA [263]. Whilst this study did not replicate these specific findings it has highlighted two new loci with possible relationships to bone health outcomes.

Epigenome wide association studies are here being used as a discovery platform for processes which may be implicated in the interaction of the in utero environment and bone health outcomes as mediated by the epigenome, all of which are complex and multifactorial. There is not strong prior knowledge of the relationships between systems under investigation with which to make precise predictions, the aim is rather to provide a starting point for generating more specific models with which to generate more precise hypotheses. This presents a challenge as there are many sources of noise which could obscure any relationships which do exist between these properties or produce the spurious appearance of a relationship when none may exist. Striking the balance between sensitivity and specificity is particularly challenging in the context. Lowering the threshold for specificity and admitting some type 1 errors might generate sufficient additional hits with which to attempt methods such as gene set enrichment analysis, and related approaches, to identify the biologically relevant systems and processes which may mediate the observed association. However, an excessive number of false positive inputs to such analyses can lead to spuriously identifying associated terms. Simply increasing the sample size of studies to reach the level of power necessary to detect small effect size changes is expensive, impractical and does not help when it comes to analysing existing datasets underpowered for the analysis as initially conceived. The hypothesis free approach has some advantages when attempting to elevate as yet unknown aspects of biology relevant the association being tested to the attention of investigators for further exploration. However, searching for associations between outcomes and particular genomic locations may be of limited effectiveness, even when sufficient statistical power is available to uncover very small effects. As, when individual loci have only very minor contributions to a given effect there are many of them, often spread across many systems [466]. Greater temporal and tissue specificity may reveal larger effect sizes in particular tissues at particular times. Time and tissue specific signals may currently be flattened out in the aggregate signal. The combinatorial complexity of possible times and tissues renders a brute force search impractical, so some prior reason based in biological understanding is likely to be needed to go looking in a particular time and/or tissue for an association.

If the primary interest is in identifying pathways or other functional biological units then making use of dimentionality reduction methods such as weighted correlation network analysis (WGCNA) [467] could potentially help to address some of the power issues faced by these studies. Though this approach also has the limitation that biologically relevant effects may be realised through small perturbations across many systems [466], meaning no whole network may stand out. Grouping the outcomes into effects on correlated gene networks rather than individual genes dramatically reduces the number of statistical tests performed. This approach could be used to narrow the set of tests to perform when looking for gene level associations in other datasets. If changes in gene networks are associated with an outcome of interest in one cohort it is reasonable to take this prior information to a second cohort and test only genes in this network for an association with the outcome of interest dramatically reducing the number of CpG level tests. One could also perform the reciprocal analysis, (dimentionality reduction in the second cohort and CpG level tests on a reduced set on the first cohort), as a means of validation. An ongoing collaboration with colleagues at MRC-IEU, University of Bristol is including this data in a meta-analysis and is replicating several of these EWAS in other cohorts. This provides an opportunity to attempt to replicate the sites identified here and ascertain if they are sufficiently robust to warrant functional follow-up.

6.3 Assessing Biological ageing by DNA methylation changes within Alu repeat elements

Repetitive elements make up some 45% of the human genome [410] and the global hypomethylation observed with age is driven by these repeats [411], but to date limited coverage of these regions [213] has meant that their potential to contain information relevant to biological ageing has gone underexplored.

The best Alu DNA methylation age predictor constructed in Chapter 5 was able to predict chronological age with an R of 0.65 from a training set of n=774 in an unrelated replication set of n=664 with a median absolute error of 8.1 years. Whilst this is less accurate than many of the other DNA methylation based age predictors [434] it was not the primary goal of this predictor to generate the most accurate age predictions but rather to be sufficiently accurate to capture a signature of age acceleration specific to the Alu repeat elements. The difference between the predicted and chronological age, the age acceleration, was strongly correlated with the chronological age such that the age of older individuals is prone to be overestimated and vice versa. One explanation for this may be down to the limitations of DNA methylation quantitation by MeDIP-seq. The elastic net regression may have identified loci which have a consistent direction of change with age but are of variable magnitude. If selecting for sites which had a relatively consistent magnitude in the training set but which varied in a manner which skews higher in the larger population this could lead to the overestimates of the age of older samples and the underestimates of the age of the younger ones. This is speculative and it would be interesting to examine further by looking at the properties of the data in the training and prediction groups of the predictor sites and seeing how they are distributed. In addition it may be possible to study this bias with a simulation approach to see what data properties can produce this pattern of error. This effect in the quantile normalised data was not mitigated by binarising the data by locus, absolute methylation estimates, or in the raw reads per million base pairs data. The difficulty with absolute DNA methylation quantitation by MeDIP-seq is not simple to resolve with common data transformations.

The strong association between Alu age acceleration and chronological age limited the interpretability of the GWAS for age acceleration. This is because associations found here could easily be driven by differences in allele frequencies between different age strata within the GWAS population and not with signal driven by the Alu age acceleration independent of chronological age [475].

The samples used to train and test the models have a median age of approximately 60 years. A larger proportion of samples at a particular part of the age distribution is also a potential source of bias, as poor performance in the lower numbers of older and younger individuals will not be penalised as much as poor performance at the ages with more samples. This also negatively impacts the ability to determine the quality of the predictor as a good age range is is required to reliably estimate the R of a predictor [293]. Mitigating this by equalising the numbers in certain age groups is a possibility but a substantial number of samples must be left out to achieve this, excessively shrinking the training set. In addition this unlikely to be the sole factor at play in poor prediction performance of these clocks as similarly imperfect age distributions have been used in the training of more performant clocks.

The Twins UK dataset provides a powerful tool for assessing the impact of genetics on age predictors. The age predictors generally performed only marginally better when predicting the ages of the twins of the individuals in the training set than on unrelated individuals, suggesting a minimal impact of genetic factors on the Alu DNA methylation age predictor.

Despite the issues with the correlation of age acceleration with chronological age this work and Wang et al.’s rDNA clock [189] demonstrate that a DNA methylation based age predictors can be trained in repetitive sequence elements. This suggests that constructing DNA methylation based age predictors targeted to particular subsets of the genome is possible. Though, the aim of capturing signatures of biological ageing specific to those subsets is yet to be adequately explored. It may be possible to revisit this approach once large whole genome bisulfite or the enzymatically converted equivalent datasets are generated as it seems unlikely that it would be economical to examine the ~1.1 million Alu elements with any of the available targeted methods. Alternatively if the cost of long-read sequencing drops and the quality of methylated base calls using these methods increases [464] they may also be a viable source of such data. Other possible features on which to attempt to construct age predictors could include: MIR repeat elements some of which have been co-opted as enhancers and which are associated with tissue specific gene expression [476]; Long Terminal Repeat (LTR) elements which are enriched for chromatin marks which characterise active cis regulatory elements [477]; Histone genes because of their core function and the wide spread genomic implications of alterations in their availability. They are also located in early-replicating domains [398] so should generally have high fidelity DNA methylation copying during mitosis [123] so any changes observed here are less likely to be the product of epigenetic drift. This would include also alternative histones as they have functions in genome stability and DNA repair [478]. However, Histone genes present a relatively limited set of possible sites with which to predict.

6.4 Conclusion

The unifying theme of this work is the relationship between DNA methylation and healthy ageing. From its possible function as a mediator for the effects of early life environmental influences on long term bone health, through age-related hypermethylation of genes encoding core components of the transcriptional machinery, to signatures of biological ageing in the repetitive regions of the genome. The epigenome sits upon the genome encoding the annotations to the genome necessary for cells with diverse and dynamic functions to arise from a singular set of genetic information. The ability to construct epigenetic clocks reveals that this layer of information storage and processing contains much that is important for understanding the molecular and cellular processes of ageing. The environmental malleability of the epigenome is its core strength, it is this plasticity to adopt multiple roles that permits multicellularity [479], this malleability both leaves the epigenome open to disruption and presents the possibility of correcting any errant changes. The integrative understanding of epigenomics has the potential to contribute many novel scientific insights into the fundamental mechanisms of ageing in the years to come with profound impacts for our ability to ameliorate chronic and ageing related conditions by intervening in their underlying causes to increase longevity and healthspan.