Selected topics of statistical disclosure limitation

Date of Completion

January 2011

Keywords

Statistics

Degree

Ph.D.

Abstract

There is an ever increasing demand from researchers for access to useful microdata files. However, there are also growing concerns regarding the privacy of the individuals contained in the microdata. Ideally, microdata could be released in such a way that a balance between usefulness of the data and privacy is struck. This dissertation begins with a review of proposed methods of statistical disclosure control and techniques for assessing the privacy of such methods under different definitions of disclosure. One proposed method for accomplishing both goals is to release data sets that do not contain real values but yield the same inferences as the actual data. The idea is to view confidential data as missing and use multiple imputation techniques to create synthetic data sets. In the second chapter, different techniques were compared for creating synthetic data sets in simple scenarios with a binary variable from a utility perspective. However, one of the most pressing issues in the confidentiality literature is the quantification of privacy. One proposal, ε-differential privacy, moves away from absolute guarantees of privacy to relative guarantees. However, the selection of an appropriate ε is difficult because its interpretation is unclear. Further, when comparing different privacy preserving techniques to one another, a direct comparison cannot be made by simply comparing the respective values of ε. The aim of chapter 3 is to provide a measure that allows for direct comparison across different privacy schemes and is more easily interpreted. In turn, it is hoped that this will aid in policy debate pertaining to how much privacy is acceptable. Our proposal sets the problem in a hypothesis testing framework and uses the area under the receiver-operator characteristic (ROC) curve as a measure of privacy. The ensuing chapter presents examples of applications of the privacy metric. Examples include the release of several different types of univaraite statistics, the addition of different types of noise from the exponential power family and the release of vectors. The chapter then concludes with several examples of assessing the privacy of synthetic data sets. This is followed by a conclusion and discussion. ^

Share

COinS