Publications

2019
Sanders SJ, Sahin M, Hostyk J, Thurm A, Jacquemont S, Avillach P, Douard E, Martin CL, Modi ME, Moreno-De-Luca A, Raznahan A, Anticevic A, Dolmetsch R, Feng G, Geschwind DH, Glahn DC, Goldstein DB, Ledbetter DH, Mulle JG, Pasca SP, Samaco R, Sebat J, Pariser A, Lehner T, Gur RE, Bearden CE. A framework for the investigation of rare genetic disorders in neuropsychiatry. Nat Med 2019;25(10):1477-1487.Abstract
De novo and inherited rare genetic disorders (RGDs) are a major cause of human morbidity, frequently involving neuropsychiatric symptoms. Recent advances in genomic technologies and data sharing have revolutionized the identification and diagnosis of RGDs, presenting an opportunity to elucidate the mechanisms underlying neuropsychiatric disorders by investigating the pathophysiology of high-penetrance genetic risk factors. Here we seek out the best path forward for achieving these goals. We think future research will require consistent approaches across multiple RGDs and developmental stages, involving both the characterization of shared neuropsychiatric dimensions in humans and the identification of neurobiological commonalities in model systems. A coordinated and concerted effort across patients, families, researchers, clinicians and institutions, including rapid and broad sharing of data, is now needed to translate these discoveries into urgently needed therapies.
Alexander M, Loomis KA, van der Lei J, Duarte-Salles T, Prieto-Alhambra D, Ansell D, Pasqua A, Lapi F, Rijnbeek P, Mosseveld M, Avillach P, Egger P, Dhalwani NN, Kendrick S, Celis-Morales C, Waterworth DM, Alazawi W, Sattar N. Non-alcoholic fatty liver disease and risk of incident acute myocardial infarction and stroke: findings from matched cohort study of 18 million European adults. BMJ 2019;367:l5367.Abstract
OBJECTIVE: To estimate the risk of acute myocardial infarction (AMI) or stroke in adults with non-alcoholic fatty liver disease (NAFLD) or non-alcoholic steatohepatitis (NASH). DESIGN: Matched cohort study. SETTING: Population based, electronic primary healthcare databases before 31 December 2015 from four European countries: Italy (n=1 542 672), Netherlands (n=2 225 925), Spain (n=5 488 397), and UK (n=12 695 046). PARTICIPANTS: 120 795 adults with a recorded diagnosis of NAFLD or NASH and no other liver diseases, matched at time of NAFLD diagnosis (index date) by age, sex, practice site, and visit, recorded at six months before or after the date of diagnosis, with up to 100 patients without NAFLD or NASH in the same database. MAIN OUTCOME MEASURES: Primary outcome was incident fatal or non-fatal AMI and ischaemic or unspecified stroke. Hazard ratios were estimated using Cox models and pooled across databases by random effect meta-analyses. RESULTS: 120 795 patients with recorded NAFLD or NASH diagnoses were identified with mean follow-up 2.1-5.5 years. After adjustment for age and smoking the pooled hazard ratio for AMI was 1.17 (95% confidence interval 1.05 to 1.30; 1035 events in participants with NAFLD or NASH, 67 823 in matched controls). In a group with more complete data on risk factors (86 098 NAFLD and 4 664 988 matched controls), the hazard ratio for AMI after adjustment for systolic blood pressure, type 2 diabetes, total cholesterol level, statin use, and hypertension was 1.01 (0.91 to 1.12; 747 events in participants with NAFLD or NASH, 37 462 in matched controls). After adjustment for age and smoking status the pooled hazard ratio for stroke was 1.18 (1.11 to 1.24; 2187 events in participants with NAFLD or NASH, 134 001 in matched controls). In the group with more complete data on risk factors, the hazard ratio for stroke was 1.04 (0.99 to 1.09; 1666 events in participants with NAFLD, 83 882 in matched controls) after further adjustment for type 2 diabetes, systolic blood pressure, total cholesterol level, statin use, and hypertension. CONCLUSIONS: The diagnosis of NAFLD in current routine care of 17.7 million patient appears not to be associated with AMI or stroke risk after adjustment for established cardiovascular risk factors. Cardiovascular risk assessment in adults with a diagnosis of NAFLD is important but should be done in the same way as for the general population.
Zhong Q-Y, Mittal LP, Nathan MD, Brown KM, Knudson González D, Cai T, Finan S, Gelaye B, Avillach P, Smoller JW, Karlson EW, Cai T, Williams MA. Use of natural language processing in electronic medical records to identify pregnant women with suicidal behavior: towards a solution to the complex classification problem. Eur J Epidemiol 2019;34(2):153-162.Abstract
We developed algorithms to identify pregnant women with suicidal behavior using information extracted from clinical notes by natural language processing (NLP) in electronic medical records. Using both codified data and NLP applied to unstructured clinical notes, we first screened pregnant women in Partners HealthCare for suicidal behavior. Psychiatrists manually reviewed clinical charts to identify relevant features for suicidal behavior and to obtain gold-standard labels. Using the adaptive elastic net, we developed algorithms to classify suicidal behavior. We then validated algorithms in an independent validation dataset. From 275,843 women with codes related to pregnancy or delivery, 9331 women screened positive for suicidal behavior by either codified data (N = 196) or NLP (N = 9,145). Using expert-curated features, our algorithm achieved an area under the curve of 0.83. By setting a positive predictive value comparable to that of diagnostic codes related to suicidal behavior (0.71), we obtained a sensitivity of 0.34, specificity of 0.96, and negative predictive value of 0.83. The algorithm identified 1423 pregnant women with suicidal behavior among 9331 women screened positive. Mining unstructured clinical notes using NLP resulted in a 11-fold increase in the number of pregnant women identified with suicidal behavior, as compared to solely reliance on diagnostic codes.
2018
Gutiérrez-Sacristán A, Guedj R, Korodi G, Stedman J, Furlong LI, Patel CJ, Kohane IS, Avillach P. Rcupcake: an R package for querying and analyzing biomedical data through BD2K Pic-Sure restful API [Internet]. Bioinformatics 2018; Publisher's VersionAbstract

Motivation: In the era of big data and precision medicine, the number of databases containing clinical, environmental, self-reported and biochemical variables is increasing exponentially. Enabling the experts to focus on their research questions rather than on computational data management, access and analysis is one of the most significant challenges nowadays.

Results: We present Rcupcake, an R package that contains a variety of functions for leveraging different databases through the BD2K PIC-SURE RESTful API and facilitating its query, analysis and interpretation. The package offers a variety of analysis and visualization tools, including the study of the phenotype co-occurrence and prevalence, according to multiple layers of data, such as phenome, exposome or genome.

Availability and implementation: The package is implemented in R and is available under Mozilla v2 license from GitHub (https://github.com/hms-dbmi/Rcupcake). Two reproducible case studies are also available (https://github.com/hms-dbmi/Rcupcake-case-studies/blob/master/SSCcaseStu..., https://github.com/hms-dbmi/Rcupcake-case-studies/blob/master/NHANEScase...).

Contact: paul_avillach@hms.harvard.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

Zhong Q-Y, Karlson EW, Gelaye B, Finan S, Avillach P, Smoller JW, Cai T, Williams MA. Screening pregnant women for suicidal behavior in electronic medical records: diagnostic codes vs. clinical notes processed by natural language processing. BMC Med Inform Decis Mak 2018;18(1):30.Abstract
BACKGROUND: We examined the comparative performance of structured, diagnostic codes vs. natural language processing (NLP) of unstructured text for screening suicidal behavior among pregnant women in electronic medical records (EMRs). METHODS: Women aged 10-64 years with at least one diagnostic code related to pregnancy or delivery (N = 275,843) from Partners HealthCare were included as our "datamart." Diagnostic codes related to suicidal behavior were applied to the datamart to screen women for suicidal behavior. Among women without any diagnostic codes related to suicidal behavior (n = 273,410), 5880 women were randomly sampled, of whom 1120 had at least one mention of terms related to suicidal behavior in clinical notes. NLP was then used to process clinical notes for the 1120 women. Chart reviews were performed for subsamples of women. RESULTS: Using diagnostic codes, 196 pregnant women were screened positive for suicidal behavior, among whom 149 (76%) had confirmed suicidal behavior by chart review. Using NLP among those without diagnostic codes, 486 pregnant women were screened positive for suicidal behavior, among whom 146 (30%) had confirmed suicidal behavior by chart review. CONCLUSIONS: The use of NLP substantially improves the sensitivity of screening suicidal behavior in EMRs. However, the prevalence of confirmed suicidal behavior was lower among women who did not have diagnostic codes for suicidal behavior but screened positive by NLP. NLP should be used together with diagnostic codes for future EMR-based phenotyping studies for suicidal behavior.
Alexander M, Loomis KA, Fairburn-Beech J, van der Lei J, Duarte-Salles T, Prieto-Alhambra D, Ansell D, Pasqua A, Lapi F, Rijnbeek P, Mosseveld M, Avillach P, Egger P, Kendrick S, Waterworth DM, Sattar N, Alazawi W. Real-world data reveal a diagnostic gap in non-alcoholic fatty liver disease. BMC Med 2018;16(1):130.Abstract
BACKGROUND: Non-alcoholic fatty liver disease (NAFLD) is the most common cause of liver disease worldwide. It affects an estimated 20% of the general population, based on cohort studies of varying size and heterogeneous selection. However, the prevalence and incidence of recorded NAFLD diagnoses in unselected real-world health-care records is unknown. We harmonised health records from four major European territories and assessed age- and sex-specific point prevalence and incidence of NAFLD over the past decade. METHODS: Data were extracted from The Health Improvement Network (UK), Health Search Database (Italy), Information System for Research in Primary Care (Spain) and Integrated Primary Care Information (Netherlands). Each database uses a different coding system. Prevalence and incidence estimates were pooled across databases by random-effects meta-analysis after a log-transformation. RESULTS: Data were available for 17,669,973 adults, of which 176,114 had a recorded diagnosis of NAFLD. Pooled prevalence trebled from 0.60% in 2007 (95% confidence interval: 0.41-0.79) to 1.85% (0.91-2.79) in 2014. Incidence doubled from 1.32 (0.83-1.82) to 2.35 (1.29-3.40) per 1000 person-years. The FIB-4 non-invasive estimate of liver fibrosis could be calculated in 40.6% of patients, of whom 29.6-35.7% had indeterminate or high-risk scores. CONCLUSIONS: In the largest primary-care record study of its kind to date, rates of recorded NAFLD are much lower than expected suggesting under-diagnosis and under-recording. Despite this, we have identified rising incidence and prevalence of the diagnosis. Improved recognition of NAFLD may identify people who will benefit from risk factor modification or emerging therapies to prevent progression to cardiometabolic and hepatic complications.
Zhong QY, Gelaye B, Fricchione GL, Avillach P, Karlson EW, Williams MA. Adverse obstetric and neonatal outcomes complicated by psychosis among pregnant women in the United States [Internet]. BMC Pregnancy Childbirth 2018;18(1) Publisher's VersionAbstract
Adverse obstetric and neonatal outcomes among women with psychosis, particularly affective psychosis, has rarely been studied at the population level. We aimed to assess the risk of adverse obstetric and neonatal outcomes among women with psychosis (schizophrenia, affective psychosis, and other psychoses). 
Perera G, Pedersen L, Ansel D, Alexander M, Arrighi MH, Avillach P, Foskett N, Gini R, Gordon MF, Gungabissoon U, Mayer M-A, Novak G, Rijnbeek P, Trifirò G, van der Lei J, Visser PJ, Stewart R. Dementia prevalence and incidence in a federation of European Electronic Health Record databases: The European Medical Informatics Framework resource. Alzheimers Dement 2018;14(2):130-139.Abstract
INTRODUCTION: The European Medical Information Framework consortium has assembled electronic health record (EHR) databases for dementia research. We calculated dementia prevalence and incidence in 25 million persons from 2004 to 2012. METHODS: Six EHR databases (three primary care and three secondary care) from five countries were interrogated. Dementia was ascertained by consensus harmonization of clinical/diagnostic codes. Annual period prevalences and incidences by age and gender were calculated and meta-analyzed. RESULTS: The six databases contained 138,625 dementia cases. Age-specific prevalences were around 30% of published estimates from community samples and incidences were around 50%. Pooled prevalences had increased from 2004 to 2012 in all age groups but pooled incidences only after age 75 years. Associations with age and gender were stable over time. DISCUSSION: The European Medical Information Framework initiative supports EHR data on unprecedented number of people with dementia. Age-specific prevalences and incidences mirror estimates from community samples in pattern at levels that are lower but increasing over time.
Zhong Q-Y, Gelaye B, Smoller JW, Avillach P, Cai T, Williams MA. Adverse obstetric outcomes during delivery hospitalizations complicated by suicidal behavior among US pregnant women. PLoS One 2018;13(2):e0192943.Abstract
OBJECTIVE: The effects of suicidal behavior on obstetric outcomes remain dangerously unquantified. We sought to report on the risk of adverse obstetric outcomes for US women with suicidal behavior at the time of delivery. METHODS: We performed a cross-sectional analysis of delivery hospitalizations from 2007-2012 National (Nationwide) Inpatient Sample. From the same hospitalization record, International Classification of Diseases codes were used to identify suicidal behavior and adverse obstetric outcomes. Adjusted odds ratios (aOR) and 95% confidence intervals (CI) were obtained using logistic regression. RESULTS: Of the 23,507,597 delivery hospitalizations, 2,180 were complicated by suicidal behavior. Women with suicidal behavior were at a heightened risk for outcomes including antepartum hemorrhage (aOR = 2.34; 95% CI: 1.47-3.74), placental abruption (aOR = 2.07; 95% CI: 1.17-3.66), postpartum hemorrhage (aOR = 2.33; 95% CI: 1.61-3.37), premature delivery (aOR = 3.08; 95% CI: 2.43-3.90), stillbirth (aOR = 10.73; 95% CI: 7.41-15.56), poor fetal growth (aOR = 1.70; 95% CI: 1.10-2.62), and fetal anomalies (aOR = 3.72; 95% CI: 2.57-5.40). No significant association was observed for maternal suicidal behavior with cesarean delivery, induction of labor, premature rupture of membranes, excessive fetal growth, and fetal distress. The mean length of stay was longer for women with suicidal behavior. CONCLUSION: During delivery hospitalization, women with suicidal behavior are at increased risk for many adverse obstetric outcomes, highlighting the importance of screening for and providing appropriate clinical care for women with suicidal behavior during pregnancy.
2017
Bourgeois FT, Avillach P, Kong SW, Heinz MM, Tran TA, Chakrabarty R, Bickel J, Sliz P, Borglund EM, Kornetsky S, Mandl KD. Development of the Precision Link Biobank at Boston Children's Hospital: Challenges and Opportunities. J Pers Med 2017;7(4)Abstract
Increasingly, biobanks are being developed to support organized collections of biological specimens and associated clinical information on broadly consented, diverse patient populations. We describe the implementation of a pediatric biobank, comprised of a fully-informed patient cohort linking specimens to phenotypic data derived from electronic health records (EHR). The Biobank was launched after multiple stakeholders' input and implemented initially in a pilot phase before hospital-wide expansion in 2016. In-person informed consent is obtained from all participants enrolling in the Biobank and provides permission to: (1) access EHR data for research; (2) collect and use residual specimens produced as by-products of routine care; and (3) share de-identified data and specimens outside of the institution. Participants are recruited throughout the hospital, across diverse clinical settings. We have enrolled 4900 patients to date, and 41% of these have an associated blood sample for DNA processing. Current efforts are focused on aligning the Biobank with other ongoing research efforts at our institution and extending our electronic consenting system to support remote enrollment. A number of pediatric-specific challenges and opportunities is reviewed, including the need to re-consent patients when they reach 18 years of age, the ability to enroll family members accompanying patients and alignment with disease-specific research efforts at our institution and other pediatric centers to increase cohort sizes, particularly for rare diseases.
Gutiérrez-Sacristán A, Guedj R, Korodi G, Stedman J, Furlong LI, Patel CJ, Kohane IS, Avillach P. Rcupcake: an R package for querying and analyzing biomedical data through the BD2K PIC-SURE RESTful API. Bioinformatics 2017;Abstract
Motivation: In the era of big data and precision medicine, the number of databases containing clinical, environmental, self-reported, and biochemical variables is increasing exponentially. Enabling the experts to focus on their research questions rather than on computational data management, access and analysis is one of the most significant challenges nowadays. Results: We present Rcupcake, an R package that contains a variety of functions for leveraging different databases through the BD2K PIC-SURE RESTful API and facilitating its query, analysis and interpretation. The package offers a variety of analysis and visualization tools, including the study of the phenotype co-occurrence and prevalence, according to multiple layers of data, such as phenome, exposome or genome. Availability: The package is implemented in R and is available under Mozilla v2 license from GitHub (https://github.com/hms-dbmi/Rcupcake). Two reproducible case studies are also available (https://github.com/hms-dbmi/Rcupcake-case-studies/blob/master/SSCcaseStu..., https://github.com/hms-dbmi/Rcupcake-case-studies/blob/master/NHANEScase...). Contact: paul_avillach@hms.harvard.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Tran A, Tran L, Geghre N, Darmon D, M R, D B, JM G, H H, K R-S, H C, P A. Health assessment of French university students and risk factors associated with mental health disorders. PLoS One 2017;12(11)Abstract

The first year of university is a particularly stressful period and can impact academic performance and students' health. The aim of this study was to evaluate the health and lifestyle of undergraduates and assess risk factors associated with psychiatric symptoms.

Between September 2012 and June 2013, we included all undergraduate students who underwent compulsory a medical visit at the university medical service in Nice (France) during which they were screened for potential diseases during a diagnostic interview. Data were collected prospectively in the CALCIUM database (Consultations Assistés par Logiciel pour les Centres Inter-Universitaire de Médecine) and included information about the students' lifestyle (living conditions, dietary behavior, physical activity, use of recreational drugs). The prevalence of psychiatric symptoms related to depression, anxiety and panic attacks was assessed and risk factors for these symptoms were analyzed using logistic regression.

A total of 4,184 undergraduates were included. Prevalence for depression, anxiety and panic attacks were 12.6%, 7.6% and 1.0%, respectively. During the 30 days preceding the evaluation, 0.6% of the students regularly drank alcohol, 6.3% were frequent-to-heavy tobacco smokers, and 10.0% smoked marijuana. Dealing with financial difficulties and having learning disabilities were associated with psychiatric symptoms. Students who were dissatisfied with their living conditions and those with poor dietary behavior were at risk of depression. Being a woman and living alone were associated with anxiety. Students who screened positively for any psychiatric disorder assessed were at a higher risk of having another psychiatric disorder concomitantly.

The prevalence of psychiatric disorders in undergraduate students is low but the rate of students at risk of developing chronic disease is far from being negligible. Understanding predictors for these symptoms may improve students' health by implementing targeted prevention campaigns. Further research in other French universities is necessary to confirm our results.

Murphy SN, Avillach P, Bellazzi R, Phillips L, Gabetta M, Eran A, McDuffie MT, Kohane IS. Combining clinical and genomics queries using i2b2 - Three methods. PLoS One 2017;Abstract
We are fortunate to be living in an era of twin biomedical data surges: a burgeoning representation of human phenotypes in the medical records of our healthcare systems, and high-throughput sequencing making rapid technological advances. The difficulty representing genomic data and its annotations has almost by itself led to the recognition of a biomedical “Big Data” challenge, and the complexity of healthcare data only compounds the problem to the point that coherent representation of both systems on the same platform seems insuperably difficult. We investigated the capability for complex, integrative genomic and clinical queries to be supported in the Informatics for Integrating Biology and the Bedside (i2b2) translational software package. Three different data integration approaches were developed: The first is based on Sequence Ontology, the second is based on the tranSMART engine, and the third on CouchDB. These novel methods for representing and querying complex genomic and clinical data on the i2b2 platform are available today for advancing precision medicine.
Jannot A-S, Zapletal E, Avillach P, Mamzer M-F, Burgan A, Degoulet P. The Georges Pompidou University Hospital Clinical Data Warehouse: A 8-years follow-up experience. International Journal of Medical Informatics 2017;102:21-28.Abstract

Background

When developed jointly with clinical information systems, clinical data warehouses (CDWs) facilitate the reuse of healthcare data and leverage clinical research.

Objective

To describe both data access and use for clinical research, epidemiology and health serviceresearch of the “Hôpital Européen Georges Pompidou” (HEGP) CDW.

Methods

The CDW has been developed since 2008 using an i2b2 platform. It was made available to health professionals and researchers in October 2010. Procedures to access data have been implemented and different access levels have been distinguished according to the nature of queries.

Results

As of July 2016, the CDW contained the consolidated data of over 860,000 patients followed since the opening of the HEGP hospital in July 2000. These data correspond to more than 122 million clinical item values, 124 million biological item values, and 3.7 million free text reports. The ethics committee of the hospital evaluates all CDW projects that generate secondary data marts. Characteristics of the 74 research projects validated between January 2011 and December 2015 are described.

Conclusion

The use of HEGP CDWs is a key facilitator for clinical research studies. It required however important methodological and organizational support efforts from a biomedical informatics department.

Becker BFH, Avillach P, Romio S, van Mulligan EM, Weibel D, Sturkenboom MCJM, Kors JA. CodeMapper: semiautomatic coding of case definitions. A contribution from the ADVANCE project. Pharmacoepidemiology & Drug Safety 2017;26(8):990-1005.Abstract

Assessment of drug and vaccine effects by combining information from different healthcare databases in the European Union requires extensive efforts in the harmonization of codes as different vocabularies are being used across countries. In this paper, we present a web application called CodeMapper, which assists in the mapping of case definitions to codes from different vocabularies, while keeping a transparent record of the complete mapping process.

CodeMapper builds upon coding vocabularies contained in the Metathesaurus of the Unified Medical Language System. The mapping approach consists of three phases. First, medical concepts are automatically identified in a free-text case definition. Second, the user revises the set of medical concepts by adding or removing concepts, or expanding them to related concepts that are more general or more specific. Finally, the selected concepts are projected to codes from the targeted coding vocabularies. We evaluated the application by comparing codes that were automatically generated from case definitions by applying CodeMapper's concept identification and successive concept expansion, with reference codes that were manually created in a previous epidemiological study.

Automated concept identification alone had a sensitivity of 0.246 and positive predictive value (PPV) of 0.420 for reproducing the reference codes. Three successive steps of concept expansion increased sensitivity to 0.953 and PPV to 0.616.

Automatic concept identification in the case definition alone was insufficient to reproduce the reference codes, but CodeMapper's operations for concept expansion provide an effective, efficient, and transparent way for reproducing the reference codes.

Kothari C, Wack M, Hassen-Khodja C, Finan S, Savova G, O'Boyle M, Bliss G, Cornell A, Horn E, Davis R, Jacobs J, Kohane I, Avillach P. Phelan-McDermid syndrome data network: Integrating patient reported outcomes with clinical notes and curated genetic reports [Internet]. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics 2017; Publisher's VersionAbstract
The heterogeneity of patient phenotype data are an impediment to the research into the origins and progression of neuropsychiatric disorders. This difficulty is compounded in the case of rare disorders such as Phelan-McDermid Syndrome (PMS) by the paucity of patient clinical data. PMS is a rare syndromic genetic cause of autism and intellectual deficiency. In this paper, we describe the Phelan-McDermid Syndrome Data Network (PMS_DN), a platform that facilitates research into phenotype–genotype correlation and progression of PMS by: a) integrating knowledge of patient phenotypes extracted from Patient Reported Outcomes (PRO) data and clinical notes—two heterogeneous, underutilized sources of knowledge about patient phenotypes—with curated genetic information from the same patient cohort and b) making this integrated knowledge, along with a suite of statistical tools, available free of charge to authorized investigators on a Web portal https://pmsdn.hms.harvard.edu. PMS_DN is a Patient Centric Outcomes Research Initiative (PCORI) where patients and their families are involved in all aspects of the management of patient data in driving research into PMS. To foster collaborative research, PMS_DN also makes patient aggregates from this knowledge available to authorized investigators using distributed research networks such as the PCORnet PopMedNet. PMS_DN is hosted on a scalable cloud based environment and complies with all patient data privacy regulations. As of October 31, 2016, PMS_DN integrates high-quality knowledge extracted from the clinical notes of 112 patients and curated genetic reports of 176 patients with preprocessed PRO data from 415 patients.
2016
Patel CJ, Pho N, McDuffie M, Easton-Marks J, Kothari C, Kohane IS, Avillach P. A database of human exposomes and phenomes from the US National Health and Nutrition Examination Survey. Sci Data 2016;3:160096.Abstract
The National Health and Nutrition Examination Survey (NHANES) is a population survey implemented by the Centers for Disease Control and Prevention (CDC) to monitor the health of the United States whose data is publicly available in hundreds of files. This Data Descriptor describes a single unified and universally accessible data file, merging across 255 separate files and stitching data across 4 surveys, encompassing 41,474 individuals and 1,191 variables. The variables consist of phenotype and environmental exposure information on each individual, specifically (1) demographic information, physical exam results (e.g., height, body mass index), laboratory results (e.g., cholesterol, glucose, and environmental exposures), and (4) questionnaire items. Second, the data descriptor describes a dictionary to enable analysts find variables by category and human-readable description. The datasets are available on DataDryad and a hands-on analytics tutorial is available on GitHub. Through a new big data platform, BD2K Patient Centered Information Commons (http://pic-sure.org), we provide a new way to browse the dataset via a web browser (https://nhanes.hms.harvard.edu) and provide application programming interface for programmatic access.
Tenenbaum JD, Avillach P, Benham-Hutchins M, Breitenstein MK, Crowgey EL, Hoffman MA, Jiang X, Madhavan S, Mattison JE, Nagarajan R, Ray B, Shin D, Visweswaran S, Zhao Z, Freimuth RR. An informatics research agenda to support precision medicine: seven key areas. J Am Med Inform Assoc 2016;23(4):791-5.Abstract
The recent announcement of the Precision Medicine Initiative by President Obama has brought precision medicine (PM) to the forefront for healthcare providers, researchers, regulators, innovators, and funders alike. As technologies continue to evolve and datasets grow in magnitude, a strong computational infrastructure will be essential to realize PM's vision of improved healthcare derived from personal data. In addition, informatics research and innovation affords a tremendous opportunity to drive the science underlying PM. The informatics community must lead the development of technologies and methodologies that will increase the discovery and application of biomedical knowledge through close collaboration between researchers, clinicians, and patients. This perspective highlights seven key areas that are in need of further informatics research and innovation to support the realization of PM.
Gini R, Schuemie M, Brown J, Ryan P, Vacchi E, Coppola M, Cazzola W, Coloma P, Berni R, Diallo G, Oliveira JL, Avillach P, Trifirò G, Rijnbeek P, Bellentani M, van der Lei J, Klazinga N, Sturkenboom M. Data Extraction and Management in Networks of Observational Health Care Databases for Scientific Research: A Comparison of EU-ADR, OMOP, Mini-Sentinel and MATRICE Strategies. EGEMS (Wash DC) 2016;4(1):1189.Abstract
INTRODUCTION: We see increased use of existing observational data in order to achieve fast and transparent production of empirical evidence in health care research. Multiple databases are often used to increase power, to assess rare exposures or outcomes, or to study diverse populations. For privacy and sociological reasons, original data on individual subjects can't be shared, requiring a distributed network approach where data processing is performed prior to data sharing. CASE DESCRIPTIONS AND VARIATION AMONG SITES: We created a conceptual framework distinguishing three steps in local data processing: (1) data reorganization into a data structure common across the network; (2) derivation of study variables not present in original data; and (3) application of study design to transform longitudinal data into aggregated data sets for statistical analysis. We applied this framework to four case studies to identify similarities and differences in the United States and Europe: Exploring and Understanding Adverse Drug Reactions by Integrative Mining of Clinical Records and Biomedical Knowledge (EU-ADR), Observational Medical Outcomes Partnership (OMOP), the Food and Drug Administration's (FDA's) Mini-Sentinel, and the Italian network-the Integration of Content Management Information on the Territory of Patients with Complex Diseases or with Chronic Conditions (MATRICE). FINDINGS: National networks (OMOP, Mini-Sentinel, MATRICE) all adopted shared procedures for local data reorganization. The multinational EU-ADR network needed locally defined procedures to reorganize its heterogeneous data into a common structure. Derivation of new data elements was centrally defined in all networks but the procedure was not shared in EU-ADR. Application of study design was a common and shared procedure in all the case studies. Computer procedures were embodied in different programming languages, including SAS, R, SQL, Java, and C++. CONCLUSION: Using our conceptual framework we found several areas that would benefit from research to identify optimal standards for production of empirical knowledge from existing databases.an opportunity to advance evidence-based care management. In addition, formalized CM outcomes assessment methodologies will enable us to compare CM effectiveness across health delivery settings.
Roberto G, Leal I, Sattar N, Loomis KA, Avillach P, Egger P, van Wijngaarden R, Ansell D, Reisberg S, Tammesoo M-L, Alavere H, Pasqua A, Pedersen L, Cunningham J, Tramontan L, Mayer MA, Herings R, Coloma P, Lapi F, Sturkenboom M, van der Lei J, Schuemie MJ, Rijnbeek P, Gini R. Identifying Cases of Type 2 Diabetes in Heterogeneous Data Sources: Strategy from the EMIF Project. PLoS One 2016;11(8):e0160648.Abstract
Due to the heterogeneity of existing European sources of observational healthcare data, data source-tailored choices are needed to execute multi-data source, multi-national epidemiological studies. This makes transparent documentation paramount. In this proof-of-concept study, a novel standard data derivation procedure was tested in a set of heterogeneous data sources. Identification of subjects with type 2 diabetes (T2DM) was the test case. We included three primary care data sources (PCDs), three record linkage of administrative and/or registry data sources (RLDs), one hospital and one biobank. Overall, data from 12 million subjects from six European countries were extracted. Based on a shared event definition, sixteeen standard algorithms (components) useful to identify T2DM cases were generated through a top-down/bottom-up iterative approach. Each component was based on one single data domain among diagnoses, drugs, diagnostic test utilization and laboratory results. Diagnoses-based components were subclassified considering the healthcare setting (primary, secondary, inpatient care). The Unified Medical Language System was used for semantic harmonization within data domains. Individual components were extracted and proportion of population identified was compared across data sources. Drug-based components performed similarly in RLDs and PCDs, unlike diagnoses-based components. Using components as building blocks, logical combinations with AND, OR, AND NOT were tested and local experts recommended their preferred data source-tailored combination. The population identified per data sources by resulting algorithms varied from 3.5% to 15.7%, however, age-specific results were fairly comparable. The impact of individual components was assessed: diagnoses-based components identified the majority of cases in PCDs (93-100%), while drug-based components were the main contributors in RLDs (81-100%). The proposed data derivation procedure allowed the generation of data source-tailored case-finding algorithms in a standardized fashion, facilitated transparent documentation of the process and benchmarking of data sources, and provided bases for interpretation of possible inter-data source inconsistency of findings in future studies.

Pages