Publications

2022
Serret-Larmande A, Kaltman JR, Avillach P. Streamlining statistical reproducibility: NHLBI ORCHID clinical trial results reproduction [Internet]. JAMIA Open 2022;5(1) Publisher's VersionAbstract
Reproducibility in medical research has been a long-standing issue. More recently, the COVID-19 pandemic has publicly underlined this fact as the retraction of several studies reached out to general media audiences. A significant number of these retractions occurred after in-depth scrutiny of the methodology and results by the scientific community. Consequently, these retractions have undermined confidence in the peer-review process, which is not considered sufficiently reliable to generate trust in the published results. This partly stems from opacity in published results, the practical implementation of the statistical analysis often remaining undisclosed. We present a workflow that uses a combination of informatics tools to foster statistical reproducibility: an open-source programming language, Jupyter Notebook, cloud-based data repository, and an application programming interface can streamline an analysis and help to kick-start new analyses. We illustrate this principle by (1) reproducing the results of the ORCHID clinical trial, which evaluated the efficacy of hydroxychloroquine in COVID-19 patients, and (2) expanding on the analyses conducted in the original trial by investigating the association of premedication with biological laboratory results. Such workflows will be encouraged for future publications from National Heart, Lung, and Blood Institute-funded studies.
2021
Feroe AG, Uppal N, Gutiérrez-Sacristán A, Mousavi S, Greenspun P, Surati R, Kohane IS, Avillach P. Medication Use in the Management of Comorbidities Among Individuals With Autism Spectrum Disorder From a Large Nationwide Insurance Database. [Internet]. JAMA Pediatrics 2021; Publisher's VersionAbstract

Abstract

Importance: Although there is no pharmacological treatment for autism spectrum disorder (ASD) itself, behavioral and pharmacological therapies have been used to address its symptoms and common comorbidities. A better understanding of the medications used to manage comorbid conditions in this growing population is critical; however, most previous efforts have been limited in size, duration, and lack of broad representation.

Objective: To use a nationally representative database to uncover trends in the prevalence of co-occurring conditions and medication use in the management of symptoms and comorbidities over time among US individuals with ASD.

Design, setting, and participants: This retrospective, population-based cohort study mined a nationwide, managed health plan claims database containing more than 86 million unique members. Data from January 1, 2014, to December 31, 2019, were used to analyze prescription frequency and diagnoses of comorbidities. A total of 26 722 individuals with ASD who had been prescribed at least 1 of 24 medications most commonly prescribed to treat ASD symptoms or comorbidities during the 6-year study period were included in the analysis.

Exposures: Diagnosis codes for ASD based on International Classification of Diseases, Ninth Revision, and International Statistical Classification of Diseases and Related Health Problems, Tenth Revision.

Main outcomes and measures: Quantitative estimates of prescription frequency for the 24 most commonly prescribed medications among the study cohort and the most common comorbidities associated with each medication in this population.

Results: Among the 26 722 individuals with ASD included in the analysis (77.7% male; mean [SD] age, 14.45 [9.40] years), polypharmacy was common, ranging from 28.6% to 31.5%. Individuals' prescription regimens changed frequently within medication classes, rather than between classes. The prescription frequency of a specific medication varied considerably, depending on the coexisting diagnosis of a given comorbidity. Of the 24 medications assessed, 15 were associated with at least a 15% prevalence of a mood disorder, and 11 were associated with at least a 15% prevalence of attention-deficit/hyperactivity disorder. For patients taking antipsychotics, the 2 most common comorbidities were combined type attention-deficit/hyperactivity disorder (11.6%-17.8%) and anxiety disorder (13.1%-30.1%).

Conclusions and relevance: This study demonstrated considerable variability and transiency in the use of prescription medications by US clinicians to manage symptoms and comorbidities associated with ASD. These findings support the importance of early and ongoing surveillance of patients with ASD and co-occurring conditions and offer clinicians insight on the targeted therapies most commonly used to manage co-occurring conditions. Future research and policy efforts are critical to assess the extent to which pharmacological management of comorbidities affects quality of life and functioning in patients with ASD while continuing to optimize clinical guidelines, to ensure effective care for this growing population.

Bourgeois FT, Gutiérrez-Sacristán A, Keller MS, Liu M, Hong C, Bonzel CL, Tan ALM, Aronow BJ, Boeker M, Booth J, Cruz RJ, Devkota B, García Barrio N, Geva A, Hanauer DA, Hutch MR, Issitt RW, Klann JG, Luo Y, Mandl KD, Mao C, Moal B, Moshal KL, Murphy SN, Neuraz A, Ngiam KY, Omenn GS, Patel LP, Jiménez MP, Sebire NJ, Balazote PS, Serret-Larmande A, South AM, Spiridou A, Taylor DM, Tippmann P, Visweswaran S, Weber GM, Kohane IS, Cai T, Avillach P. International Analysis of Electronic Health Records of Children and Youth Hospitalized With COVID-19 Infection in 6 Countries [Internet]. JAMA Network Open 2021; Publisher's VersionAbstract

Abstract

Importance: Additional sources of pediatric epidemiological and clinical data are needed to efficiently study COVID-19 in children and youth and inform infection prevention and clinical treatment of pediatric patients.

Objective: To describe international hospitalization trends and key epidemiological and clinical features of children and youth with COVID-19.

Design, setting, and participants: This retrospective cohort study included pediatric patients hospitalized between February 2 and October 10, 2020. Patient-level electronic health record (EHR) data were collected across 27 hospitals in France, Germany, Spain, Singapore, the UK, and the US. Patients younger than 21 years who tested positive for COVID-19 and were hospitalized at an institution participating in the Consortium for Clinical Characterization of COVID-19 by EHR were included in the study.

Main outcomes and measures: Patient characteristics, clinical features, and medication use.

Results: There were 347 males (52%; 95% CI, 48.5-55.3) and 324 females (48%; 95% CI, 44.4-51.3) in this study's cohort. There was a bimodal age distribution, with the greatest proportion of patients in the 0- to 2-year (199 patients [30%]) and 12- to 17-year (170 patients [25%]) age range. Trends in hospitalizations for 671 children and youth found discrete surges with variable timing across 6 countries. Data from this cohort mirrored national-level pediatric hospitalization trends for most countries with available data, with peaks in hospitalizations during the initial spring surge occurring within 23 days in the national-level and 4CE data. A total of 27 364 laboratory values for 16 laboratory tests were analyzed, with mean values indicating elevations in markers of inflammation (C-reactive protein, 83 mg/L; 95% CI, 53-112 mg/L; ferritin, 417 ng/mL; 95% CI, 228-607 ng/mL; and procalcitonin, 1.45 ng/mL; 95% CI, 0.13-2.77 ng/mL). Abnormalities in coagulation were also evident (D-dimer, 0.78 ug/mL; 95% CI, 0.35-1.21 ug/mL; and fibrinogen, 477 mg/dL; 95% CI, 385-569 mg/dL). Cardiac troponin, when checked (n = 59), was elevated (0.032 ng/mL; 95% CI, 0.000-0.080 ng/mL). Common complications included cardiac arrhythmias (15.0%; 95% CI, 8.1%-21.7%), viral pneumonia (13.3%; 95% CI, 6.5%-20.1%), and respiratory failure (10.5%; 95% CI, 5.8%-15.3%). Few children were treated with COVID-19-directed medications.

Conclusions and relevance: This study of EHRs of children and youth hospitalized for COVID-19 in 6 countries demonstrated variability in hospitalization trends across countries and identified common complications and laboratory abnormalities in children and youth with COVID-19 infection. Large-scale informatics-based approaches to integrate and analyze data across health care systems complement methods of disease surveillance and advance understanding of epidemiological and clinical features associated with COVID-19 in children and youth.

 
2020
Kessler MD, Loesch DP, Perry JA, Heard-Costa NL, Taliun D, Cade BE, Wang H, Daya M, Ziniti J, Datta S, Celedon JC, Soto-Quiros ME, Avila L, Weiss ST, Barnes K, Redline SS, Vasan RS, Johnson AD, Mathias RA, Hernandez R, Wilson JG, Nickerson DA, Abecasis G, Browning SR, Zollner S, O'Connell JR, Mitchell BD, for Consortium NHLBIT-OPM, Group TOPMPGW, O'Connor TD. De novo mutations across 1,465 diverse genomes reveal mutational insights and reductions in the Amish founder populations [Internet]. Proc Natl Acad Sci U S A 2020;117(5):2560-2569. Publisher's VersionAbstract

De novo mutations (DNMs), or mutations that appear in an individual despite not being seen in their parents, are an important source of genetic variation whose impact is relevant to studies of human evolution, genetics, and disease. Utilizing high-coverage whole-genome sequencing data as part of the Trans-Omics for Precision Medicine (TOPMed) Program, we called 93,325 single-nucleotide DNMs across 1,465 trios from an array of diverse human populations, and used them to directly estimate and analyze DNM counts, rates, and spectra. We find a significant positive correlation between local recombination rate and local DNM rate, and that DNM rate explains a substantial portion (8.98 to 34.92%, depending on the model) of the genome-wide variation in population-level genetic variation from 41K unrelated TOPMed samples. Genome-wide heterozygosity does correlate with DNM rate, but only explains <1% of variation. While we are underpowered to see small differences, we do not find significant differences in DNM rate between individuals of European, African, and Latino ancestry, nor across ancestrally distinct segments within admixed individuals. However, we did find significantly fewer DNMs in Amish individuals, even when compared with other Europeans, and even after accounting for parental age and sequencing center. Specifically, we found significant reductions in the number of C→A and T→C mutations in the Amish, which seem to underpin their overall reduction in DNMs. Finally, we calculated near-zero estimates of narrow sense heritability (h 2), which suggest that variation in DNM rate is significantly shaped by nonadditive genetic effects and the environment.

Keywords: Amish; de novo mutations; diversity; mutation rate; recombination.

James G, Collin E, Lawrance M, Mueller A, Podhorna J, Zaremba-Pechmann L, Rijnbeek P, van der Lei J, Avillach P, Pederson L, Ansell D, Pasqua A, Mosseveld M, Grosdidier S, Gungabissoon U, Egger P, Stewart R, Celis-Morales C, Alexander M, Novak G, Gordon MF. Treatment pathway analysis of newly diagnosed dementia patients in four electronic health record databases in Europe [Internet]. Soc Psychiatry Psychiatry Epidemiol. 2020; Publisher's VersionAbstract

Purpose: Real-world studies to describe the use of first, second and third line therapies for the management and symptomatic treatment of dementia are lacking. This retrospective cohort study describes the first-, second- and third-line therapies used for the management and symptomatic treatment of dementia, and in particular Alzheimer's Disease.

Methods: Medical records of patients with newly diagnosed dementia between 1997 and 2017 were collected using four databases from the UK, Denmark, Italy and the Netherlands.

Results: We identified 191,933 newly diagnosed dementia patients in the four databases between 1997 and 2017 with 39,836 (IPCI (NL): 3281, HSD (IT): 1601, AUH (DK): 4474, THIN (UK): 30,480) fulfilling the inclusion criteria, and of these, 21,131 had received a specific diagnosis of Alzheimer's disease. The most common first line therapy initiated within a year (± 365 days) of diagnosis were Acetylcholinesterase inhibitors, namely rivastigmine in IPCI, donepezil in HSD and the THIN and the N-methyl-D-aspartate blocker memantine in AUH.

Conclusion: We provide a real-world insight into the heterogeneous management and treatment pathways of newly diagnosed dementia patients and a subset of Alzheimer's Disease patients from across Europe.

Keywords: Alzheimer’s disease; Dementia; Epidemiology; Real-world data.

Sylvestre E, Bouzille G, McDuffie M, Chazard E, Avillach P, Cuggia M. A Semi-Automated Approach for Multilingual Terminology Matching: Mapping the French Version of the ICD-10 to the ICD-10 CM [Internet]. Stud Health Technol Inform 2020; Publisher's VersionAbstract

The aim of this study was to develop a simple method to map the French International Statistical Classification of Diseases and Related Health Problems, 10th revision (ICD-10) with the International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10 CM). We sought to map these terminologies forward (ICD-10 to ICD-10 CM) and backward (ICD-10 CM to ICD-10) and to assess the accuracy of these two mappings. We used several terminology resources such as the Unified Medical Language System (UMLS) Metathesaurus, Bioportal, the latest version available of the French ICD-10 and several official mapping files between different versions of the ICD-10. We first retrieved existing partial mapping between the ICD-10 and the ICD-10 CM. Then, we automatically matched the ICD-10 with the ICD-10-CM, using our different reference mapping files. Finally, we used manual review and natural language processing (NLP) to match labels between the two terminologies. We assessed the accuracy of both methods with a manual review of a random dataset from the results files. The overall matching was between 94.2 and 100%. The backward mapping was better than the forward one, especially regarding exact matches. In both cases, the NLP step was highly accurate. When there are no available experts from the ontology or NLP fields for multi-lingual ontology matching, this simple approach enables secondary reuse of Electronic Health Records (EHR) and billing data for research purposes in an international context.

Keywords: Clinical terminologies; ICD-10; Interoperability; Multilingual matching.

Zachariasse JM, Nieboer D, Maconochie IK, Smit FJ, Alves CF, Greber-Platzer S, Tsolia MN, Steyerberg EW, Avillach P, van der Lei J, Moll HA. Development and validation of a Paediatric Early Warning Score for use in the emergency department: a multicentre study. [Internet]. Lancet Child Adolesc Health 2020; Publisher's VersionAbstract

Background: Paediatric Early Warning Scores (PEWSs) are being used increasingly in hospital wards to identify children at risk of clinical deterioration, but few scores exist that were designed for use in emergency care settings. To improve the prioritisation of children in the emergency department (ED), we developed and validated an ED-PEWS.

Methods: The TrIAGE project is a prospective European observational study based on electronic health record data collected between Jan 1, 2012, and Nov 1, 2015, from five diverse EDs in four European countries (Netherlands, the UK, Austria, and Portugal). This study included data from all consecutive ED visits of children under age 16 years. The main outcome measure was a three-category reference standard (high, intermediate, low urgency) that was developed as part of the TrIAGE project as a proxy for true patient urgency. The ED-PEWS was developed based on an ordinal logistic regression model, with cross-validation by setting. After completing the study, we fully externally validated the ED-PEWS in an independent cohort of febrile children from a different ED (Greece).

Findings: Of 119 209 children, 2007 (1·7%) were of high urgency and 29 127 (24·4%) of intermediate urgency, according to our reference standard. We developed an ED-PEWS consisting of age and the predictors heart rate, respiratory rate, oxygen saturation, consciousness, capillary refill time, and work of breathing. The ED-PEWS showed a cross-validated c-statistic of 0·86 (95% prediction interval 0·82-0·90) for high-urgency patients and 0·67 (0·61-0·73) for high-urgency or intermediate-urgency patients. A cutoff of score of at least 15 was useful for identifying high-urgency patients with a specificity of 0·90 (95% CI 0·87-0·92) while a cutoff score of less than 6 was useful for identifying low-urgency patients with a sensitivity of 0·83 (0·81-0·85).

Interpretation: The proposed ED-PEWS can assist in identifying high-urgency and low-urgency patients in the ED, and improves prioritisation compared with existing PEWSs.

Funding: Stichting de Drie Lichten, Stichting Sophia Kinderziekenhuis Fonds, and the European Union's Horizon 2020 research and innovation programme.

Krissaane I, De Niz C, Gutiérrez-Sacristán A, Korodi G, Ede N, Kumar R, Lyons J, Manrai A, Patel C, Kohane I, Avillach P. Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services. [Internet]. JAMIA 2020; Publisher's VersionAbstract

Abstract

Objective: Advancements in human genomics have generated a surge of available data, fueling the growth and accessibility of databases for more comprehensive, in-depth genetic studies.

Methods: We provide a straightforward and innovative methodology to optimize cloud configuration in order to conduct genome-wide association studies. We utilized Spark clusters on both Google Cloud Platform and Amazon Web Services, as well as Hail (http://doi.org/10.5281/zenodo.2646680) for analysis and exploration of genomic variants dataset.

Results: Comparative evaluation of numerous cloud-based cluster configurations demonstrate a successful and unprecedented compromise between speed and cost for performing genome-wide association studies on 4 distinct whole-genome sequencing datasets. Results are consistent across the 2 cloud providers and could be highly useful for accelerating research in genetics.

Conclusions: We present a timely piece for one of the most frequently asked questions when moving to the cloud: what is the trade-off between speed and cost?

Keywords: cloud computing; distributed systems; genome-wide association study; whole genome.

Saez C, Gutiérrez-Sacristán A, Kohane I, Garcia-Gomez JM, Avillach P. EHRtemporalVariability: delineating temporal data-set shifts in electronic health records. [Internet]. Gigascience 2020; Publisher's VersionAbstract

Abstract

Background: Temporal variability in health-care processes or protocols is intrinsic to medicine. Such variability can potentially introduce dataset shifts, a data quality issue when reusing electronic health records (EHRs) for secondary purposes. Temporal data-set shifts can present as trends, as well as abrupt or seasonal changes in the statistical distributions of data over time. The latter are particularly complicated to address in multimodal and highly coded data. These changes, if not delineated, can harm population and data-driven research, such as machine learning. Given that biomedical research repositories are increasingly being populated with large sets of historical data from EHRs, there is a need for specific software methods to help delineate temporal data-set shifts to ensure reliable data reuse.

Results: EHRtemporalVariability is an open-source R package and Shiny app designed to explore and identify temporal data-set shifts. EHRtemporalVariability estimates the statistical distributions of coded and numerical data over time; projects their temporal evolution through non-parametric information geometric temporal plots; and enables the exploration of changes in variables through data temporal heat maps. We demonstrate the capability of EHRtemporalVariability to delineate data-set shifts in three impact case studies, one of which is available for reproducibility.

Conclusions: EHRtemporalVariability enables the exploration and identification of data-set shifts, contributing to the broad examination and repurposing of large, longitudinal data sets. Our goal is to help ensure reliable data reuse for a wide range of biomedical data users. EHRtemporalVariability is designed for technical users who are programmatically utilizing the R package, as well as users who are not familiar with programming via the Shiny user interface.Availability: https://github.com/hms-dbmi/EHRtemporalVariability/Reproducible vignette: https://cran.r-project.org/web/packages/EHRtemporalVariability/vignettes... demo: http://ehrtemporalvariability.upv.es/.

Keywords: R package; claims data; data quality; data-set shifts; electronic health records; information geometry; research repositories; scientific data sets; temporal variability; visual analytics.

Miller TA, Avillach P, Mandl KD. Experiences implementing scalable, containerized, cloud-based NLP for extracting biobank participant phenotypes at scale. [Internet]. JAMIA Open 2020;3(2):185-189. Publisher's VersionAbstract

Abstract

Objective: To develop scalable natural language processing (NLP) infrastructure for processing the free text in electronic health records (EHRs).

Materials and methods: We extend the open-source Apache cTAKES NLP software with several standard technologies for scalability. We remove processing bottlenecks by monitoring component queue size. We process EHR free text for patients in the PrecisionLink Biobank at Boston Children's Hospital. The extracted concepts are made searchable via a web-based portal.

Results: We processed over 1.2 million notes for over 8000 patients, extracting 154 million concepts. Our largest tested configuration processes over 1 million notes per day.

Discussion: The unique information represented by extracted NLP concepts has great potential to provide a more complete picture of patient status.

Conclusion: NLP large EHR document collections can be done efficiently, in service of high throughput phenotyping.

Luo Y, Eran A, Palmer N, Avillach P, Levy-Moonshine A, Szolovitis P, Kohane I. A multidimensional precision medicine approach identifies an autism subtype characterized by dyslipidemia. [Internet]. Nat Med 2020; Publisher's VersionAbstract

Abstract

The promise of precision medicine lies in data diversity. More than the sheer size of biomedical data, it is the layering of multiple data modalities, offering complementary perspectives, that is thought to enable the identification of patient subgroups with shared pathophysiology. In the present study, we use autism to test this notion. By combining healthcare claims, electronic health records, familial whole-exome sequences and neurodevelopmental gene expression patterns, we identified a subgroup of patients with dyslipidemia-associated autism.

Newby D, Prieto-Alhambra D, Duarte-Salles T, Ansell D, Pedersen L, van der Lei J, Mosseveld M, Rijnbeek P, James G, Alexander M, Egger P, Podhorna J, Stewart R, Perera G, Avillach P, Grosdidier S, Lovestone S, Nevado-Holgado AJ. Methotrexate and relative risk of dementia amongst patients with rheumatoid arthritis: a multi-national multi-database case-control study. Alzheimers Res Ther 2020;12(1):38.Abstract
BACKGROUND: Inflammatory processes have been shown to play a role in dementia. To understand this role, we selected two anti-inflammatory drugs (methotrexate and sulfasalazine) to study their association with dementia risk. METHODS: A retrospective matched case-control study of patients over 50 with rheumatoid arthritis (486 dementia cases and 641 controls) who were identified from electronic health records in the UK, Spain, Denmark and the Netherlands. Conditional logistic regression models were fitted to estimate the risk of dementia. RESULTS: Prior methotrexate use was associated with a lower risk of dementia (OR 0.71, 95% CI 0.52-0.98). Furthermore, methotrexate use with therapy longer than 4 years had the lowest risk of dementia (odds ratio 0.37, 95% CI 0.17-0.79). Sulfasalazine use was not associated with dementia (odds ratio 0.88, 95% CI 0.57-1.37). CONCLUSIONS: Further studies are still required to clarify the relationship between prior methotrexate use and duration as well as biological treatments with dementia risk.
Gutiérrez-Sacristán A, De Niz C, Kothari C, Kong SW, Mandl KD, Avillach P. GenoPheno: cataloging large-scale phenotypic and next-generation sequencing data within human datasets. Brief Bioinform 2020;Abstract
Precision medicine promises to revolutionize treatment, shifting therapeutic approaches from the classical one-size-fits-all to those more tailored to the patient's individual genomic profile, lifestyle and environmental exposures. Yet, to advance precision medicine's main objective-ensuring the optimum diagnosis, treatment and prognosis for each individual-investigators need access to large-scale clinical and genomic data repositories. Despite the vast proliferation of these datasets, locating and obtaining access to many remains a challenge. We sought to provide an overview of available patient-level datasets that contain both genotypic data, obtained by next-generation sequencing, and phenotypic data-and to create a dynamic, online catalog for consultation, contribution and revision by the research community. Datasets included in this review conform to six specific inclusion parameters that are: (i) contain data from more than 500 human subjects; (ii) contain both genotypic and phenotypic data from the same subjects; (iii) include whole genome sequencing or whole exome sequencing data; (iv) include at least 100 recorded phenotypic variables per subject; (v) accessible through a website or collaboration with investigators and (vi) make access information available in English. Using these criteria, we identified 30 datasets, reviewed them and provided results in the release version of a catalog, which is publicly available through a dynamic Web application and on GitHub. Users can review as well as contribute new datasets for inclusion (Web: https://avillachlab.shinyapps.io/genophenocatalog/; GitHub: https://github.com/hms-dbmi/GenoPheno-CatalogShiny).
Mandl KD, Glauser T, Krantz ID, Avillach P, Bartels A, Beggs AH, Biswas S, Bourgeois FT, Corsmo J, Dauber A, Devkota B, Fleisher GR, Heath AP, Helbig I, Hirschhorn JN, Kilbourn J, Kong SW, Kornetsky S, Majzoub JA, Marsolo K, Martin LJ, Nix J, Schwarzhoff A, Stedman J, Strauss A, Sund KL, Taylor DM, White PS, Marsh E, Grimberg A, Hawkes C. Correction: The Genomics Research and Innovation Network: creating an interoperable, federated, genomics learning system. Genet Med 2020;22(2):449.Abstract
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
Versmée G, Versmée L, Dusenne M, Jalali N, Avillach P. dbgap2x: an R package to explore and extract data from the database of Genotypes and Phenotypes (dbGaP). Bioinformatics 2020;36(4):1305-1306.Abstract
SUMMARY: Based on the Genomic Data Sharing Policy issued in August 2007, the National Institutes of Health (NIH) has supported several repositories such as the database of Genotypes and Phenotypes (dbGaP). dbGaP is an online repository that provides access to large-scale genetic and phenotypic datasets with more than 1000 studies. However, navigating the website and understanding the relationship between the studies are not easy tasks. Moreover, the decryption of the files is a complex procedure. In this study we propose the dbgap2x R package that covers a broad range of functions for searching dbGaP studies, exploring the characteristics of a study and easily decrypting the files from dbGaP. AVAILABILITY AND IMPLEMENTATION: dbgap2x is an R package with the code available at https://github.com/gversmee/dbgap2x. A containerized version including the package, a Jupyter server and with a Notebook example is available at https://hub.docker.com/r/gversmee/dbgap2x. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Mandl KD, Glauser T, Krantz ID, Avillach P, Bartels A, Beggs AH, Biswas S, Bourgeois FT, Corsmo J, Dauber A, Devkota B, Fleisher GR, Heath AP, Helbig I, Hirschhorn JN, Kilbourn J, Kong SW, Kornetsky S, Majzoub JA, Marsolo K, Martin LJ, Nix J, Schwarzhoff A, Stedman J, Strauss A, Sund KL, Taylor DM, White PS, Marsh E, Grimberg A, Hawkes C. The Genomics Research and Innovation Network: creating an interoperable, federated, genomics learning system. Genet Med 2020;22(2):371-380.Abstract
PURPOSE: Clinicians and researchers must contextualize a patient's genetic variants against population-based references with detailed phenotyping. We sought to establish globally scalable technology, policy, and procedures for sharing biosamples and associated genomic and phenotypic data on broadly consented cohorts, across sites of care. METHODS: Three of the nation's leading children's hospitals launched the Genomic Research and Innovation Network (GRIN), with federated information technology infrastructure, harmonized biobanking protocols, and material transfer agreements. Pilot studies in epilepsy and short stature were completed to design and test the collaboration model. RESULTS: Harmonized, broadly consented institutional review board (IRB) protocols were approved and used for biobank enrollment, creating ever-expanding, compatible biobanks. An open source federated query infrastructure was established over genotype-phenotype databases at the three hospitals. Investigators securely access the GRIN platform for prep to research queries, receiving aggregate counts of patients with particular phenotypes or genotypes in each biobank. With proper approvals, de-identified data is exported to a shared analytic workspace. Investigators at all sites enthusiastically collaborated on the pilot studies, resulting in multiple publications. Investigators have also begun to successfully utilize the infrastructure for grant applications. CONCLUSIONS: The GRIN collaboration establishes the technology, policy, and procedures for a scalable genomic research network.
2019
Zhong Q-Y, Gelaye B, Karlson EW, Avillach P, Smoller JW, Cai T, Williams MA. Associations of antepartum suicidal behaviour with adverse infant and obstetric outcomes. Paediatr Perinat Epidemiol 2019;33(2):137-144.Abstract
BACKGROUND: Relatively little is known about antepartum suicidal behaviour and pregnancy outcomes. We examined associations of antepartum suicidal behaviour, alone and in combination with psychiatric disorders, with adverse infant and obstetric outcomes. METHODS: We included 188 925 singleton livebirths from a retrospective cohort (1996-2016). Suicidal behaviour, psychiatric disorders, and outcomes were derived from electronic medical records. We performed multivariable logistic regressions with generalised estimating equations to estimate adjusted odds ratios (aOR) with 95% confidence intervals (95%CI). RESULTS: The prevalence of antepartum suicidal behaviour was 152.44 per 100 000 singleton livebirths. Nearly two-thirds (64.24%) of women with suicidal behaviour also had psychiatric disorders. Compared to women without psychiatric disorders and suicidal behaviour, women with psychiatric disorders alone had 1.3-fold to 1.4-fold increased odds of delivering low birthweight or preterm infants and 1.2-fold increased odds of experiencing obstetric complications. Women with suicidal behaviour alone had increased odds of preterm labour (aOR 2.05, 95% CI 1.16, 3.62). Women with both suicidal behaviour and psychiatric disorders had > twofold increased odds of delivering low birthweight (aOR 2.52, 95% CI 1.40, 4.54), preterm birth (aOR 2.44, 95% CI 1.63, 3.66), and low birthweight/preterm birth (aOR 2.30, 95% CI 1.54, 3.44) infants; the odds of preterm labour (aOR 1.62, 95% CI 1.06, 2.47), placental abruption (aOR 2.33, 95% CI 1.20, 4.51), preterm rupture of membranes (aOR 1.63, 95% CI 1.08, 2.46), and postpartum haemorrhage (aOR 1.93, 95%CI 1.09, 3.40) were elevated. CONCLUSIONS: Antepartum suicidal behaviour, when co-occurring with psychiatric disorders, is associated with increased odds of adverse infant and obstetric outcomes. Future studies are warranted to understand the causal roles of suicidal behaviour and psychiatric disorders in pregnancy.
Reisberg S, Galwey N, Avillach P, Sahlqvist A-S, Kolberg L, Mägi R, Esko T, Vilo J, James G. Comparison of variation in frequency for SNPs associated with asthma or liver disease between Estonia, HapMap populations and the 1000 genome project populations. Int J Immunogenet 2019;46(2):49-58.Abstract
Allele-specific analyses to understand frequency differences across populations, particularly populations not well studied, are important to help identify variants that may have a functional effect on disease mechanisms and phenotypic predisposition, facilitating new Genome-Wide Association Studies (GWAS). We aimed to compare the allele frequency of 11 asthma-associated and 16 liver disease-associated single nucleotide polymorphisms (SNPs) between the Estonian, HapMap and 1000 genome project populations. When comparing EGCUT with HapMap populations, the largest difference in allele frequencies was observed with the Maasai population in Kinyawa, Kenya, with 12 SNP variants reporting statistical significance. Similarly, when comparing EGCUT with 1000 genomes project populations, the largest difference in allele frequencies was observed with pooled African populations with 22 SNP variants reporting statistical significance. For 11 asthma-associated and 16 liver disease-associated SNPs, Estonians are genetically similar to other European populations but significantly different from African populations. Understanding differences in genetic architecture between ethnic populations is important to facilitate new GWAS targeted at underserved ethnic groups to enable novel genetic findings to aid the development of new therapies to reduce morbidity and mortality.
James G, Reisberg S, Lepik K, Galwey N, Avillach P, Kolberg L, Mägi R, Esko T, Alexander M, Waterworth D, Loomis KA, Vilo J. An exploratory phenome wide association study linking asthma and liver disease genetic variants to electronic health records from the Estonian Biobank. PLoS One 2019;14(4):e0215026.Abstract
The Estonian Biobank, governed by the Institute of Genomics at the University of Tartu (Biobank), has stored genetic material/DNA and continuously collected data since 2002 on a total of 52,274 individuals representing ~5% of the Estonian adult population and is increasing. To explore the utility of data available in the Biobank, we conducted a phenome-wide association study (PheWAS) in two areas of interest to healthcare researchers; asthma and liver disease. We used 11 asthma and 13 liver disease-associated single nucleotide polymorphisms (SNPs), identified from published genome-wide association studies, to test our ability to detect established associations. We confirmed 2 asthma and 5 liver disease associated variants at nominal significance and directionally consistent with published results. We found 2 associations that were opposite to what was published before (rs4374383:AA increases risk of NASH/NAFLD, rs11597086 increases ALT level). Three SNP-diagnosis pairs passed the phenome-wide significance threshold: rs9273349 and E06 (thyroiditis, p = 5.50x10-8); rs9273349 and E10 (type-1 diabetes, p = 2.60x10-7); and rs2281135 and K76 (non-alcoholic liver diseases, including NAFLD, p = 4.10x10-7). We have validated our approach and confirmed the quality of the data for these conditions. Importantly, we demonstrate that the extensive amount of genetic and medical information from the Estonian Biobank can be successfully utilized for scientific research.
Clarke DJB, Wang L, Jones A, Wojciechowicz ML, Torre D, Jagodnik KM, Jenkins SL, McQuilton P, Flamholz Z, Silverstein MC, Schilder BM, Robasky K, Castillo C, Idaszak R, Ahalt SC, Williams J, Schurer S, Cooper DJ, de Miranda Azevedo R, Klenk JA, Haendel MA, Nedzel J, Avillach P, Shimoyama ME, Harris RM, Gamble M, Poten R, Charbonneau AL, Larkin J, Brown TC, Bonazzi VR, Dumontier MJ, Sansone S-A, Ma'ayan A. FAIRshake: Toolkit to Evaluate the FAIRness of Research Digital Resources. Cell Syst 2019;9(5):417-421.Abstract
As more digital resources are produced by the research community, it is becoming increasingly important to harmonize and organize them for synergistic utilization. The findable, accessible, interoperable, and reusable (FAIR) guiding principles have prompted many stakeholders to consider strategies for tackling this challenge. The FAIRshake toolkit was developed to enable the establishment of community-driven FAIR metrics and rubrics paired with manual and automated FAIR assessments. FAIR assessments are visualized as an insignia that can be embedded within digital-resources-hosting websites. Using FAIRshake, a variety of biomedical digital resources were manually and automatically evaluated for their level of FAIRness.

Pages