The AERS dataset is one of the few remaining, large publicly available medical data sets that until now have not been published as Linked Data. An adverse event (AE) is an adverse change in health or side effect while the patient is receiving treatment. A serious adverse event (SAE) is life-threatening and, amongst others, may result in death, requires hospitalisation or prolongation of existing hospitalisation and will result in persistent or significant disability or incapacity. Known chemotherapy-related SAEs in breast cancer (US only) were linked to 22% of hospitalisations. Clearly, from a clinical perspective, serious adverse events are very important: this is where Clinical Decision Support can make a huge difference.
The AERS data files are published on a quarterly basis, as zip files containing dollar separated tables. These zip files are roughly 20MB in size, and available from the FDA website from two separate static webpages. Converting this data is a process with six steps:
This conversion was implemented as a pipeline called through a Python provenance wrapper: PROV-O-Matic, see http://github.com/Data2Semantics. This wrapper generates provenance information expressed in the PROV-O vocabulary. The AERS-LD dataset covers all AERS reports from the years 2005-2012.
The AERS dataset is uniquely positioned amidst other HCLS datasets, providing opportunities for linking to drug, location, patient and diagnosis related information. Furthermore, reports in AERS are filled in by hand. Linking out to other datasets could help in identity reconciliation (e.g. drug names, marketing names, and chemical substances) as well as detecting misspellings (e.g. in manufacturer names). We specified mappings between the UMLS, Sider, LinkedCT, Drugbank, DBPedia and CTCAE datasets using the SILK link specification language, resulting in over 60K links based only on exact string matches. Using less exact matching on drug names can have unwanted consequences.
The fields of health care and life science (HCLS) have traditionally seen a lot of attention from the Semantic Web community, and vice versa: semantic web languages, and their predecessors have proven to be a convenient paradigm for representing biomedical knowledge.
Vocabularies in the HCLS field are highly standardised; computer analysis, and computer-based information exchange are ubiquitous throughout the field (viz. the Humanities). As a result, many (bio)medical databases and terminologies are now published as linked data, taking up about a fourth of the Linked Data cloud. Examples are medical vocabularies such as SnomedCT, MeSH, MedDRA, and the NCI Thesaurus (all part of the Unified Medical Language System (UMLS)), and datasets such as LinkedCT (clinical trials), Sider, Drugbank and RxNorm (drug information), Uniprot (protein sequences), to name but a few.
The AERS-LD dataset covers all AERS reports from the years 2005-2012. Reports of other years are available as separate dumps.
The Data2Semantics project is a consortium of VU University Amsterdam, the University of Amsterdam, Elsevier Publishing, Philips Research and the Data and Networked Services (DANS) of the Netherlands Royal Academy of Science. Data2Semantics is funded under COMMIT.