DataSHIELD: Taking the analysis to the data
Irrespective of discipline, data access and analysis barriers result from a range of scenarios: * ethical-legal restrictions surrounding confidentiality and the sharing of, or access to, disclosive data; * intellectual property or licensing issues surrounding research access to raw data; *the physical size of the data is a limiting factor. DataSHIELD (www.datashield.ac.uk) was born of the requirement in the biomedical and social sciences to co-analyse individual patient data from different sources, without disclosing sensitive information. DataSHIELD comprises a series of R packages enabling the researcher to perform distributed analysis on the individual level data, whilst satisfying the strict ethical-legal-governance restrictions related to sharing this data type. Furthermore, under the DataSHIELD infrastructure - set up as a client-server model - raw data never leaves the data provider (the server) and no individual level data can be seen by the researcher (the client). Base functionality in the DataSHIELD R packages includes descriptive stats (e.g. mean), exploratory stats (e.g. histogram), contingency tables (1-dimensional and 2-dimensional frequency tables) and modelling (survival analysis using piecewise exponential regression, glm). The modular nature of DataSHIELD has allowed the scoping of additional data types to expand DataSHIELD functionality with respect to genomic, text and geospatial data. Different infrastructure models are also possible - tailored for pooled co-analysis, single site analysis and linked data analysis. DataSHIELD has been successfully piloted in two European biomedical studies, sharing data across 14 different biobanks to investigate healthy obesity and the effect of environmental determinants on health. It is of proven value in the biomedical and social science domains, but has potential utility wider than this.