De novo mutations (DNMs), or mutations that appear in an individual despite not being seen in their parents, are an important source of genetic variation whose impact is relevant to studies of human evolution, genetics, and disease. Utilizing high-coverage whole-genome sequencing data as part of the Trans-Omics for Precision Medicine (TOPMed) Program, we called 93,325 single-nucleotide DNMs across 1,465 trios from an array of diverse human populations, and used them to directly estimate and analyze DNM counts, rates, and spectra. We find a significant positive correlation between local recombination rate and local DNM rate, and that DNM rate explains a substantial portion (8.98 to 34.92%, depending on the model) of the genome-wide variation in population-level genetic variation from 41K unrelated TOPMed samples. Genome-wide heterozygosity does correlate with DNM rate, but only explains <1% of variation. While we are underpowered to see small differences, we do not find significant differences in DNM rate between individuals of European, African, and Latino ancestry, nor across ancestrally distinct segments within admixed individuals. However, we did find significantly fewer DNMs in Amish individuals, even when compared with other Europeans, and even after accounting for parental age and sequencing center. Specifically, we found significant reductions in the number of C→A and T→C mutations in the Amish, which seem to underpin their overall reduction in DNMs. Finally, we calculated near-zero estimates of narrow sense heritability (h2), which suggest that variation in DNM rate is significantly shaped by nonadditive genetic effects and the environment.
Keywords: Amish; de novo mutations; diversity; mutation rate; recombination.
Purpose: Real-world studies to describe the use of first, second and third line therapies for the management and symptomatic treatment of dementia are lacking. This retrospective cohort study describes the first-, second- and third-line therapies used for the management and symptomatic treatment of dementia, and in particular Alzheimer's Disease.
Methods: Medical records of patients with newly diagnosed dementia between 1997 and 2017 were collected using four databases from the UK, Denmark, Italy and the Netherlands.
Results: We identified 191,933 newly diagnosed dementia patients in the four databases between 1997 and 2017 with 39,836 (IPCI (NL): 3281, HSD (IT): 1601, AUH (DK): 4474, THIN (UK): 30,480) fulfilling the inclusion criteria, and of these, 21,131 had received a specific diagnosis of Alzheimer's disease. The most common first line therapy initiated within a year (± 365 days) of diagnosis were Acetylcholinesterase inhibitors, namely rivastigmine in IPCI, donepezil in HSD and the THIN and the N-methyl-D-aspartate blocker memantine in AUH.
Conclusion: We provide a real-world insight into the heterogeneous management and treatment pathways of newly diagnosed dementia patients and a subset of Alzheimer's Disease patients from across Europe.
The aim of this study was to develop a simple method to map the French International Statistical Classification of Diseases and Related Health Problems, 10th revision (ICD-10) with the International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10 CM). We sought to map these terminologies forward (ICD-10 to ICD-10 CM) and backward (ICD-10 CM to ICD-10) and to assess the accuracy of these two mappings. We used several terminology resources such as the Unified Medical Language System (UMLS) Metathesaurus, Bioportal, the latest version available of the French ICD-10 and several official mapping files between different versions of the ICD-10. We first retrieved existing partial mapping between the ICD-10 and the ICD-10 CM. Then, we automatically matched the ICD-10 with the ICD-10-CM, using our different reference mapping files. Finally, we used manual review and natural language processing (NLP) to match labels between the two terminologies. We assessed the accuracy of both methods with a manual review of a random dataset from the results files. The overall matching was between 94.2 and 100%. The backward mapping was better than the forward one, especially regarding exact matches. In both cases, the NLP step was highly accurate. When there are no available experts from the ontology or NLP fields for multi-lingual ontology matching, this simple approach enables secondary reuse of Electronic Health Records (EHR) and billing data for research purposes in an international context.
Background: Paediatric Early Warning Scores (PEWSs) are being used increasingly in hospital wards to identify children at risk of clinical deterioration, but few scores exist that were designed for use in emergency care settings. To improve the prioritisation of children in the emergency department (ED), we developed and validated an ED-PEWS.
Methods: The TrIAGE project is a prospective European observational study based on electronic health record data collected between Jan 1, 2012, and Nov 1, 2015, from five diverse EDs in four European countries (Netherlands, the UK, Austria, and Portugal). This study included data from all consecutive ED visits of children under age 16 years. The main outcome measure was a three-category reference standard (high, intermediate, low urgency) that was developed as part of the TrIAGE project as a proxy for true patient urgency. The ED-PEWS was developed based on an ordinal logistic regression model, with cross-validation by setting. After completing the study, we fully externally validated the ED-PEWS in an independent cohort of febrile children from a different ED (Greece).
Findings: Of 119 209 children, 2007 (1·7%) were of high urgency and 29 127 (24·4%) of intermediate urgency, according to our reference standard. We developed an ED-PEWS consisting of age and the predictors heart rate, respiratory rate, oxygen saturation, consciousness, capillary refill time, and work of breathing. The ED-PEWS showed a cross-validated c-statistic of 0·86 (95% prediction interval 0·82-0·90) for high-urgency patients and 0·67 (0·61-0·73) for high-urgency or intermediate-urgency patients. A cutoff of score of at least 15 was useful for identifying high-urgency patients with a specificity of 0·90 (95% CI 0·87-0·92) while a cutoff score of less than 6 was useful for identifying low-urgency patients with a sensitivity of 0·83 (0·81-0·85).
Interpretation: The proposed ED-PEWS can assist in identifying high-urgency and low-urgency patients in the ED, and improves prioritisation compared with existing PEWSs.
Funding: Stichting de Drie Lichten, Stichting Sophia Kinderziekenhuis Fonds, and the European Union's Horizon 2020 research and innovation programme.
Objective: Advancements in human genomics have generated a surge of available data, fueling the growth and accessibility of databases for more comprehensive, in-depth genetic studies.
Methods: We provide a straightforward and innovative methodology to optimize cloud configuration in order to conduct genome-wide association studies. We utilized Spark clusters on both Google Cloud Platform and Amazon Web Services, as well as Hail (http://doi.org/10.5281/zenodo.2646680) for analysis and exploration of genomic variants dataset.
Results: Comparative evaluation of numerous cloud-based cluster configurations demonstrate a successful and unprecedented compromise between speed and cost for performing genome-wide association studies on 4 distinct whole-genome sequencing datasets. Results are consistent across the 2 cloud providers and could be highly useful for accelerating research in genetics.
Conclusions: We present a timely piece for one of the most frequently asked questions when moving to the cloud: what is the trade-off between speed and cost?
Keywords: cloud computing; distributed systems; genome-wide association study; whole genome.
Background: Temporal variability in health-care processes or protocols is intrinsic to medicine. Such variability can potentially introduce dataset shifts, a data quality issue when reusing electronic health records (EHRs) for secondary purposes. Temporal data-set shifts can present as trends, as well as abrupt or seasonal changes in the statistical distributions of data over time. The latter are particularly complicated to address in multimodal and highly coded data. These changes, if not delineated, can harm population and data-driven research, such as machine learning. Given that biomedical research repositories are increasingly being populated with large sets of historical data from EHRs, there is a need for specific software methods to help delineate temporal data-set shifts to ensure reliable data reuse.
Results: EHRtemporalVariability is an open-source R package and Shiny app designed to explore and identify temporal data-set shifts. EHRtemporalVariability estimates the statistical distributions of coded and numerical data over time; projects their temporal evolution through non-parametric information geometric temporal plots; and enables the exploration of changes in variables through data temporal heat maps. We demonstrate the capability of EHRtemporalVariability to delineate data-set shifts in three impact case studies, one of which is available for reproducibility.
Objective: To develop scalable natural language processing (NLP) infrastructure for processing the free text in electronic health records (EHRs).
Materials and methods: We extend the open-source Apache cTAKES NLP software with several standard technologies for scalability. We remove processing bottlenecks by monitoring component queue size. We process EHR free text for patients in the PrecisionLink Biobank at Boston Children's Hospital. The extracted concepts are made searchable via a web-based portal.
Results: We processed over 1.2 million notes for over 8000 patients, extracting 154 million concepts. Our largest tested configuration processes over 1 million notes per day.
Discussion: The unique information represented by extracted NLP concepts has great potential to provide a more complete picture of patient status.
Conclusion: NLP large EHR document collections can be done efficiently, in service of high throughput phenotyping.
The promise of precision medicine lies in data diversity. More than the sheer size of biomedical data, it is the layering of multiple data modalities, offering complementary perspectives, that is thought to enable the identification of patient subgroups with shared pathophysiology. In the present study, we use autism to test this notion. By combining healthcare claims, electronic health records, familial whole-exome sequences and neurodevelopmental gene expression patterns, we identified a subgroup of patients with dyslipidemia-associated autism.
BACKGROUND: Inflammatory processes have been shown to play a role in dementia. To understand this role, we selected two anti-inflammatory drugs (methotrexate and sulfasalazine) to study their association with dementia risk.
METHODS: A retrospective matched case-control study of patients over 50 with rheumatoid arthritis (486 dementia cases and 641 controls) who were identified from electronic health records in the UK, Spain, Denmark and the Netherlands. Conditional logistic regression models were fitted to estimate the risk of dementia.
RESULTS: Prior methotrexate use was associated with a lower risk of dementia (OR 0.71, 95% CI 0.52-0.98). Furthermore, methotrexate use with therapy longer than 4 years had the lowest risk of dementia (odds ratio 0.37, 95% CI 0.17-0.79). Sulfasalazine use was not associated with dementia (odds ratio 0.88, 95% CI 0.57-1.37).
CONCLUSIONS: Further studies are still required to clarify the relationship between prior methotrexate use and duration as well as biological treatments with dementia risk.
Precision medicine promises to revolutionize treatment, shifting therapeutic approaches from the classical one-size-fits-all to those more tailored to the patient's individual genomic profile, lifestyle and environmental exposures. Yet, to advance precision medicine's main objective-ensuring the optimum diagnosis, treatment and prognosis for each individual-investigators need access to large-scale clinical and genomic data repositories. Despite the vast proliferation of these datasets, locating and obtaining access to many remains a challenge. We sought to provide an overview of available patient-level datasets that contain both genotypic data, obtained by next-generation sequencing, and phenotypic data-and to create a dynamic, online catalog for consultation, contribution and revision by the research community. Datasets included in this review conform to six specific inclusion parameters that are: (i) contain data from more than 500 human subjects; (ii) contain both genotypic and phenotypic data from the same subjects; (iii) include whole genome sequencing or whole exome sequencing data; (iv) include at least 100 recorded phenotypic variables per subject; (v) accessible through a website or collaboration with investigators and (vi) make access information available in English. Using these criteria, we identified 30 datasets, reviewed them and provided results in the release version of a catalog, which is publicly available through a dynamic Web application and on GitHub. Users can review as well as contribute new datasets for inclusion (Web: https://avillachlab.shinyapps.io/genophenocatalog/; GitHub: https://github.com/hms-dbmi/GenoPheno-CatalogShiny).
SUMMARY: Based on the Genomic Data Sharing Policy issued in August 2007, the National Institutes of Health (NIH) has supported several repositories such as the database of Genotypes and Phenotypes (dbGaP). dbGaP is an online repository that provides access to large-scale genetic and phenotypic datasets with more than 1000 studies. However, navigating the website and understanding the relationship between the studies are not easy tasks. Moreover, the decryption of the files is a complex procedure. In this study we propose the dbgap2x R package that covers a broad range of functions for searching dbGaP studies, exploring the characteristics of a study and easily decrypting the files from dbGaP.
AVAILABILITY AND IMPLEMENTATION: dbgap2x is an R package with the code available at https://github.com/gversmee/dbgap2x. A containerized version including the package, a Jupyter server and with a Notebook example is available at https://hub.docker.com/r/gversmee/dbgap2x.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
PURPOSE: Clinicians and researchers must contextualize a patient's genetic variants against population-based references with detailed phenotyping. We sought to establish globally scalable technology, policy, and procedures for sharing biosamples and associated genomic and phenotypic data on broadly consented cohorts, across sites of care.
METHODS: Three of the nation's leading children's hospitals launched the Genomic Research and Innovation Network (GRIN), with federated information technology infrastructure, harmonized biobanking protocols, and material transfer agreements. Pilot studies in epilepsy and short stature were completed to design and test the collaboration model.
RESULTS: Harmonized, broadly consented institutional review board (IRB) protocols were approved and used for biobank enrollment, creating ever-expanding, compatible biobanks. An open source federated query infrastructure was established over genotype-phenotype databases at the three hospitals. Investigators securely access the GRIN platform for prep to research queries, receiving aggregate counts of patients with particular phenotypes or genotypes in each biobank. With proper approvals, de-identified data is exported to a shared analytic workspace. Investigators at all sites enthusiastically collaborated on the pilot studies, resulting in multiple publications. Investigators have also begun to successfully utilize the infrastructure for grant applications.
CONCLUSIONS: The GRIN collaboration establishes the technology, policy, and procedures for a scalable genomic research network.
BACKGROUND: Relatively little is known about antepartum suicidal behaviour and pregnancy outcomes. We examined associations of antepartum suicidal behaviour, alone and in combination with psychiatric disorders, with adverse infant and obstetric outcomes.
METHODS: We included 188 925 singleton livebirths from a retrospective cohort (1996-2016). Suicidal behaviour, psychiatric disorders, and outcomes were derived from electronic medical records. We performed multivariable logistic regressions with generalised estimating equations to estimate adjusted odds ratios (aOR) with 95% confidence intervals (95%CI).
RESULTS: The prevalence of antepartum suicidal behaviour was 152.44 per 100 000 singleton livebirths. Nearly two-thirds (64.24%) of women with suicidal behaviour also had psychiatric disorders. Compared to women without psychiatric disorders and suicidal behaviour, women with psychiatric disorders alone had 1.3-fold to 1.4-fold increased odds of delivering low birthweight or preterm infants and 1.2-fold increased odds of experiencing obstetric complications. Women with suicidal behaviour alone had increased odds of preterm labour (aOR 2.05, 95% CI 1.16, 3.62). Women with both suicidal behaviour and psychiatric disorders had > twofold increased odds of delivering low birthweight (aOR 2.52, 95% CI 1.40, 4.54), preterm birth (aOR 2.44, 95% CI 1.63, 3.66), and low birthweight/preterm birth (aOR 2.30, 95% CI 1.54, 3.44) infants; the odds of preterm labour (aOR 1.62, 95% CI 1.06, 2.47), placental abruption (aOR 2.33, 95% CI 1.20, 4.51), preterm rupture of membranes (aOR 1.63, 95% CI 1.08, 2.46), and postpartum haemorrhage (aOR 1.93, 95%CI 1.09, 3.40) were elevated.
CONCLUSIONS: Antepartum suicidal behaviour, when co-occurring with psychiatric disorders, is associated with increased odds of adverse infant and obstetric outcomes. Future studies are warranted to understand the causal roles of suicidal behaviour and psychiatric disorders in pregnancy.
Allele-specific analyses to understand frequency differences across populations, particularly populations not well studied, are important to help identify variants that may have a functional effect on disease mechanisms and phenotypic predisposition, facilitating new Genome-Wide Association Studies (GWAS). We aimed to compare the allele frequency of 11 asthma-associated and 16 liver disease-associated single nucleotide polymorphisms (SNPs) between the Estonian, HapMap and 1000 genome project populations. When comparing EGCUT with HapMap populations, the largest difference in allele frequencies was observed with the Maasai population in Kinyawa, Kenya, with 12 SNP variants reporting statistical significance. Similarly, when comparing EGCUT with 1000 genomes project populations, the largest difference in allele frequencies was observed with pooled African populations with 22 SNP variants reporting statistical significance. For 11 asthma-associated and 16 liver disease-associated SNPs, Estonians are genetically similar to other European populations but significantly different from African populations. Understanding differences in genetic architecture between ethnic populations is important to facilitate new GWAS targeted at underserved ethnic groups to enable novel genetic findings to aid the development of new therapies to reduce morbidity and mortality.
The Estonian Biobank, governed by the Institute of Genomics at the University of Tartu (Biobank), has stored genetic material/DNA and continuously collected data since 2002 on a total of 52,274 individuals representing ~5% of the Estonian adult population and is increasing. To explore the utility of data available in the Biobank, we conducted a phenome-wide association study (PheWAS) in two areas of interest to healthcare researchers; asthma and liver disease. We used 11 asthma and 13 liver disease-associated single nucleotide polymorphisms (SNPs), identified from published genome-wide association studies, to test our ability to detect established associations. We confirmed 2 asthma and 5 liver disease associated variants at nominal significance and directionally consistent with published results. We found 2 associations that were opposite to what was published before (rs4374383:AA increases risk of NASH/NAFLD, rs11597086 increases ALT level). Three SNP-diagnosis pairs passed the phenome-wide significance threshold: rs9273349 and E06 (thyroiditis, p = 5.50x10-8); rs9273349 and E10 (type-1 diabetes, p = 2.60x10-7); and rs2281135 and K76 (non-alcoholic liver diseases, including NAFLD, p = 4.10x10-7). We have validated our approach and confirmed the quality of the data for these conditions. Importantly, we demonstrate that the extensive amount of genetic and medical information from the Estonian Biobank can be successfully utilized for scientific research.
Clarke DJB, Wang L, Jones A, Wojciechowicz ML, Torre D, Jagodnik KM, Jenkins SL, McQuilton P, Flamholz Z, Silverstein MC, Schilder BM, Robasky K, Castillo C, Idaszak R, Ahalt SC, Williams J, Schurer S, Cooper DJ, de Miranda Azevedo R, Klenk JA, Haendel MA, Nedzel J, Avillach P, Shimoyama ME, Harris RM, Gamble M, Poten R, Charbonneau AL, Larkin J, Brown TC, Bonazzi VR, Dumontier MJ, Sansone S-A, Ma'ayan A. FAIRshake: Toolkit to Evaluate the FAIRness of Research Digital Resources. Cell Syst 2019;9(5):417-421.Abstract
As more digital resources are produced by the research community, it is becoming increasingly important to harmonize and organize them for synergistic utilization. The findable, accessible, interoperable, and reusable (FAIR) guiding principles have prompted many stakeholders to consider strategies for tackling this challenge. The FAIRshake toolkit was developed to enable the establishment of community-driven FAIR metrics and rubrics paired with manual and automated FAIR assessments. FAIR assessments are visualized as an insignia that can be embedded within digital-resources-hosting websites. Using FAIRshake, a variety of biomedical digital resources were manually and automatically evaluated for their level of FAIRness.
De novo and inherited rare genetic disorders (RGDs) are a major cause of human morbidity, frequently involving neuropsychiatric symptoms. Recent advances in genomic technologies and data sharing have revolutionized the identification and diagnosis of RGDs, presenting an opportunity to elucidate the mechanisms underlying neuropsychiatric disorders by investigating the pathophysiology of high-penetrance genetic risk factors. Here we seek out the best path forward for achieving these goals. We think future research will require consistent approaches across multiple RGDs and developmental stages, involving both the characterization of shared neuropsychiatric dimensions in humans and the identification of neurobiological commonalities in model systems. A coordinated and concerted effort across patients, families, researchers, clinicians and institutions, including rapid and broad sharing of data, is now needed to translate these discoveries into urgently needed therapies.
OBJECTIVE: To estimate the risk of acute myocardial infarction (AMI) or stroke in adults with non-alcoholic fatty liver disease (NAFLD) or non-alcoholic steatohepatitis (NASH).
DESIGN: Matched cohort study.
SETTING: Population based, electronic primary healthcare databases before 31 December 2015 from four European countries: Italy (n=1 542 672), Netherlands (n=2 225 925), Spain (n=5 488 397), and UK (n=12 695 046).
PARTICIPANTS: 120 795 adults with a recorded diagnosis of NAFLD or NASH and no other liver diseases, matched at time of NAFLD diagnosis (index date) by age, sex, practice site, and visit, recorded at six months before or after the date of diagnosis, with up to 100 patients without NAFLD or NASH in the same database.
MAIN OUTCOME MEASURES: Primary outcome was incident fatal or non-fatal AMI and ischaemic or unspecified stroke. Hazard ratios were estimated using Cox models and pooled across databases by random effect meta-analyses.
RESULTS: 120 795 patients with recorded NAFLD or NASH diagnoses were identified with mean follow-up 2.1-5.5 years. After adjustment for age and smoking the pooled hazard ratio for AMI was 1.17 (95% confidence interval 1.05 to 1.30; 1035 events in participants with NAFLD or NASH, 67 823 in matched controls). In a group with more complete data on risk factors (86 098 NAFLD and 4 664 988 matched controls), the hazard ratio for AMI after adjustment for systolic blood pressure, type 2 diabetes, total cholesterol level, statin use, and hypertension was 1.01 (0.91 to 1.12; 747 events in participants with NAFLD or NASH, 37 462 in matched controls). After adjustment for age and smoking status the pooled hazard ratio for stroke was 1.18 (1.11 to 1.24; 2187 events in participants with NAFLD or NASH, 134 001 in matched controls). In the group with more complete data on risk factors, the hazard ratio for stroke was 1.04 (0.99 to 1.09; 1666 events in participants with NAFLD, 83 882 in matched controls) after further adjustment for type 2 diabetes, systolic blood pressure, total cholesterol level, statin use, and hypertension.
CONCLUSIONS: The diagnosis of NAFLD in current routine care of 17.7 million patient appears not to be associated with AMI or stroke risk after adjustment for established cardiovascular risk factors. Cardiovascular risk assessment in adults with a diagnosis of NAFLD is important but should be done in the same way as for the general population.
We developed algorithms to identify pregnant women with suicidal behavior using information extracted from clinical notes by natural language processing (NLP) in electronic medical records. Using both codified data and NLP applied to unstructured clinical notes, we first screened pregnant women in Partners HealthCare for suicidal behavior. Psychiatrists manually reviewed clinical charts to identify relevant features for suicidal behavior and to obtain gold-standard labels. Using the adaptive elastic net, we developed algorithms to classify suicidal behavior. We then validated algorithms in an independent validation dataset. From 275,843 women with codes related to pregnancy or delivery, 9331 women screened positive for suicidal behavior by either codified data (N = 196) or NLP (N = 9,145). Using expert-curated features, our algorithm achieved an area under the curve of 0.83. By setting a positive predictive value comparable to that of diagnostic codes related to suicidal behavior (0.71), we obtained a sensitivity of 0.34, specificity of 0.96, and negative predictive value of 0.83. The algorithm identified 1423 pregnant women with suicidal behavior among 9331 women screened positive. Mining unstructured clinical notes using NLP resulted in a 11-fold increase in the number of pregnant women identified with suicidal behavior, as compared to solely reliance on diagnostic codes.