B2361 - Assessing the extent to which disclosurecontrol techniques impact on data utility - 18/12/2014
Cohort studies are required to comply with stringent ethico-legal safeguards when using individual level personal data; particularly when these data relate to sensitive topics. Many of the CLOSER cohort studies make assurances to study participants that participant identity will be known to core study staff, but hidden to the research end users of the data. Cohort data managers use a variety of processes to 'hide' participant identities, ranging from removing name and address information from data sets to complex statistical processes used to mask or block access to the underlying individual level data. In addition to any reassurances made to participants, cohort studies are required to comply with a range of legislation relating to participant confidentiality. The Data Protection Act 1998 makes a distinction between personal data and anonymous data; where personal data is information that relates to an identifiable individual.
This includes data which includes direct identifiers or where identity can be determined through linking to other readily available information. This classification is important as the safeguards required for the use of personal information are far more stringent than the safeguards required for the use of anonymous information. The Data Protection Act 1998 requires that individuals are informed of the use of their personal information, and in the case of sensitive personal information (such as information relating to health or criminality status) that consent is obtained. Furthermore, even when these safeguards are in place the Act requires that data are de-identified as soon in the research process as possible - ideally prior to point when data are provided to researchers. Achieving anonymity in a dataset is challenging and is complicated by the fact that detailed individual-level data is relatively easy to associate back to the individual who provided them.
In 2013 the Health and Social Care Information Centre (HSCIC) released the Anonymisation Standard for the Release of Health and Social Care Data . The HSCIC's chosen methodologies are seen as consistent with the Information Commissioner's Office (ICO) Anonymisation Code of Practice. The ICO have subsequently endorsed the standards anonymisation protocol. In this context, 'release' is taken to mean the distribution of cohort information from the central collecting organisation (e.g. ALSPAC) to research end users. The Anonymisation Standard adopted a statistical process known as K-anonymisation to control for disclosure risk through the suppression of unique patterns within individual-level data. The process works by transforming individual-level values to ensure that each individual record has k other records with identical values. Through suppressing uniqueness, K anonymisation reduces the potential for deductive disclosure. A concern, however is that the loss of information inevitably involved in this process (the scale of which increases as the K threshold is raised) may lead to a reduction in the epidemiological utility of the data.
This question can be addressed empirically. In ALSPAC we have linked study data to a number of
sources of health and social administrative data including the Hospital Episodes Statistics database.
Where linkage to HES is not undertaken within explicit individual consent but is permitted under
provision of Section 251 of the NHS Act 2006 different stipulations apply depending on the sensitivity of the data items linked. Information related to Sexual Health and Mental Health, for example is considered to be particularly sensitive and in this situation stipulations around "stronger" K anonymisation are likely to apply. Around 40% of our participants have explicitly consented to data linkage therefore considerations around K-anonymisation do not apply in the same way. We are therefore able to examine the influence, if any of different levels of k-anonymisation (and other privacy protection procedures) on effect estimates derived from a particular dataset. To do this we will apply a series of different anonymisation processes to a data set used in an existing, published, ALSPAC project on prevalence and risk factors for self-harm. Through undertaking equivalent analyses in the same dataset subject to different levels of disclosure control we will investigate the effect of the following common strategies:
1) no disclosure control beyond removing direct identifiers, 2) controlling for low cell counts in each
individual variable in isolation, 3) applying 'weak' k-anonymisation (at different k thresholds) to all
pseudo-identifiers in the data set in combination, 4) applying 'strong' k-anonymisation (at different k
thresholds) to all variables in the data set bar the outcome variable, 5) applying an alternative approach to anonymisation, which perturbs the data through adding a known level of 'noise' in order to mask ALSPAC Research Proposal Form page 8 of 10 December 2010the true underlying values, and, 6) using single-site DataSHIELD as a means of restricting access to the underlying individual-level data. Option 5 will render the data to a state where it is not real in any sense (i.e. the variable values will not relate to any individual as they have been statistically altered).
We will provide the research analyst with sufficient information to remove the noise from the data in their modelling (akin to the modelling undertaken to control for measurement error problems). As the artificial noise data is known, it will be possible to remove the effect of the transformations through the modelling and therefore allow the analyst to produce accurate results without ever being aware of the true underlying individual level data that relates to a study participant. In contrast, Option 6, will not attempt to alter the values of the underlying data. Instead, DataSHIELD will operate as a protective IT framework which will allow the analyst a means of extracting statistical information without having access to the underlying individual level data. This process is consistent with the principles of the ICOs anonymisation code of practice.
We will repeat this assessment, in a different exemplar setting, using self-reported information recording breast feeding and IQ, and linked educational assessment information from the National Pupil Database. The exposure variables, outcome variables and confounder variables are all pre-determined by the choices of the original investigators. This project is not designed as an investigation of these exemplar topics