AWS - tranSMART pilot study on the cloud using Redshift

The Phelan McDermid Syndrome Data Network (PMS_DN) aims to advance knowledge about Phelan-McDermid Syndrome (PMS) by integrating diverse, complex data into a central database for patient-centered research. PMS is an extremely rare genetic disease and a known cause of autism.

PMS_DN is an 18 month project funded by the US Patient Centered Outcome Research Institute (PCORI) since April 2014. We aim to port our prototypical infrastructure at Harvard Medical School, which enables rapid analyses of patient data, to a HIPAA regulations compliant, scalable and flexible cloud-based paradigm for future research.

Individual, nominative Patient Health Information (PHI) directly sourced from patients and their families through the PMS International Registry will be integrated with Electronic Health Record (EHR) data on the i2b2/tranSMART platform. This data will be hosted on the cloud given the HIPAA Business Associate Agreement between Harvard University and AWS. EHR contain patients' health information and free text clinical notes describing the medical condition of the patient. The cTAKES tool for Natural Language Processing is used to extract knowledge from clinical notes using medical terminologies (such as SNOMED, ICD9, etc.) prior to loading into i2b2/tranSMART.

The open source i2b2/tranSMART platform comprises:
- i2b2, a data model used to store patient level information and
- tranSMART, a web portal providing non specialist users with tools to interrogate the data and postulate research hypotheses.

To handle increasing data volumes and potential new sources of data, and to promote patient engagement, we intend to conduct a pilot study that involves hosting this infrastructure on a paradigm which provides high storage, computing power and availability, AWS.
Our project plan comprises 2 phases:
- Phase 1 will involve transforming the open access NDAR (National Database for Autism Research) database to the i2b2/tranSMART model. The transformed NDAR and the existing Oracle databases will be ported to the AWS cloud. The computing power of AWS will be leveraged to identify comorbidities between PMS and autism patients, and correlations between clinical and genomic data.
- Phase 2 will involve the adaptation of the i2b2 data model to Redshift for the storage and analysis of increasing volumes of patient data. This will entail transferring NDAR phenotypic and annotated genetic data and PMS_DN data to Redshift nodes.

The massive parallel processing abilities of Redshift will enable users to query NDAR and PMS_DN data efficiently through i2b2/tranSMART. The high availability of the infrastructure will allow ready access to the platform to researchers and also to patients and their families, promoting their active engagement in governance, prioritization of research questions, and transparency of patient data use in research.
This is our first attempt as using Redshift with patient data and will be extended in the future for ongoing projects:
- NIH/BD2K Patient Centric Information Common
- NIH Neuropsychiatric Genome-Scale and RDoC Individualized Domains (N-GRID)
- Boston Children’s Hospital, Cincinnati Children’s Hospital Medical Center, Children’s Hospital of Philadelphia Genomics Research and Innovation Network (GRIN)

We request Amazon to fund this project with the following infrastructure:
- m3.xlarge instances for the NLP and the ETL servers
- c4.4xlarge instance for the i2b2/tranSMART server
- m3.2xlarge instance with 2TB provisioned IOPS storage to host the Oracle database
- dw1.xlarge on-demand node for the PMS_DN + NDAR low activity Redshift server
- dw2.xlarge on-demand node for high-throughput computations.